Closed Bug 937220 Opened 11 years ago Closed 10 years ago

crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)

Categories

(Core :: DOM: Workers, defect)

28 Branch
x86
All
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 956284
Tracking Status
firefox27 --- unaffected
firefox28 --- affected
firefox29 --- affected
firefox-esr24 --- unaffected

People

(Reporter: tracy, Assigned: khuey)

References

Details

(4 keywords, Whiteboard: [most likely fixed by bug 956284])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-321d7a93-a8f5-41da-a53a-d275b2131111.
=============================================================

This increased in volume on 20131107 and has been ramping up in volume since then.


0 	mozalloc.dll 	mozalloc_abort(char const * const) 	memory/mozalloc/mozalloc_abort.cpp
1 	xul.dll 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
Are there reports with any information further up the stack?  The first two frames are really generic.
No, I clicked into a couple dozen random reports and they all look the same.
What we do have is an abort message in the App Notes:

xpcom_runtime_abort(###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)
Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak → crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)
That's quite peculiar.  I'm not sure why it isn't printing out a more useful message.
Given that cycle collector fault and that it rose up on Nightly at the same time as bug 937191, is this another regression from bug 928312?
Does the file path mean this is a Win64 crash, or is this just built on a Win64 machine or something?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #5)
> Given that cycle collector fault and that it rose up on Nightly at the same
> time as bug 937191, is this another regression from bug 928312?

That sounds plausible, but we don't have much to go on here.
If I'm reading the crash-stats report correctly, 100% of these 233 crashes are on AMD, which is unusual.
Also of note, I looked at half a dozen crashes and none of them were on the main thread, which is evidence in favor of comment 5.
This is a win64 crash: you can tell by looking for "Build Architecture" in the crash report.

I'm looking into whether this is win64-only.
https://crash-stats.mozilla.com/search/?app_notes=cycle&app_notes=collector&app_notes=fault&version=28.0a1&_facets=version&_facets=signature&_facets=cpu_name&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=cpu_info&_columns=cpu_name shows that the recent spike is almost entirely win64, but multiple computers (not a single installation crashing repeatedly). Cycle collector aborts across other versions are distributed normally.
I'm seeing both Intel and AMD CPUID's in here, so it doesn't look manufacturer-specific. The "amd64" field just means Win64. What's misleading is that on 64-bit processes, we don't get a "GenuineIntel" or "AuthenticAMD" in label in front of the Family/Model/Stepping notes. This seems to be a general problem, not specific to our code; windbg's !cpuid extension seems to suffer the same problem.

I wonder if this is actually a Win64-only crash or if the stackwalking just happened to make it so that only Win64 builds hit this particular variant of this signature. There appear to be some odd frames between abort and NS_DebugBreak.
> I wonder if this is actually a Win64-only crash or if the stackwalking just
> happened to make it so that only Win64 builds hit this particular variant of
> this signature

I'm pretty certain that this is win64-only. The crash-stats search I linked wasn't searching by signature, but was searching for "cycle collector fault" in the abort message, which should catch all the variants.
Pointing the debugger at a local copy of the xul image somehow lets it figure out the stack better. It's an overflowing refcount in DescribeRefCountedNode.

xul!NS_DebugBreak
xul!Fault
xul!GCGraphBuilder::DescribeRefCountedNode
xul!nsDOMEventTargetHelper::cycleCollection::Traverse
xul!mozilla::dom::workers::WorkerPrivateParent<mozilla::dom::workers::WorkerPrivate>::cycleCollection::Traverse
xul!mozilla::dom::workers::XMLHttpRequest::cycleCollection::Traverse
xul!GCGraphBuilder::Traverse
xul!nsCycleCollector::MarkRoots
xul!nsCycleCollector::BeginCollection
xul!nsCycleCollector::Collect
xul!nsCycleCollector_collect
xul!`anonymous namespace'::WorkerJSRuntime::CustomGCCallback
mozjs!Collect
mozjs!js::GC
mozjs!js::DestroyContext
xul!`anonymous namespace'::WorkerThreadRunnable::Run
xul!nsThread::ProcessNextEvent
xul!NS_ProcessNextEvent
xul!nsThread::ThreadFunc
nss3!_PR_NativeRunThread
nss3!pr_root
msvcr100!_callthreadstartex
I also see the same stack as comment 14 in the signatures containing js::frontend::Parser<js::frontend::FullParseHandler>::noteNameUse. noteNameUse disappears from the stack after the debugger gets to see the full image.
Thanks for the stack!

The faults in that method are:
    if (refCount == 0)
        Fault("zero refcount", mCurrPi);
    if (refCount == UINT32_MAX)
        Fault("overflowing refcount", mCurrPi);

Presumably it is the first one.
Blocks: 928312
(In reply to Andrew McCreight [:mccr8] from comment #16)
> Thanks for the stack!
> 
> The faults in that method are:
>     if (refCount == 0)
>         Fault("zero refcount", mCurrPi);
>     if (refCount == UINT32_MAX)
>         Fault("overflowing refcount", mCurrPi);
> 
> Presumably it is the first one.

Based on the return address and the string pushed, it looks like the second one (overflow).
Weird.
This is almost certainly something horrible.
Group: core-security
So it sounds like our refcount is underflowing.  Fun.

I had an idea of how to rewrite WorkerPrivate to have less crazy ownership.  Maybe I should just do that.
Keywords: sec-high
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak] → [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)]
This continues to be a topcrash on trunk, even if now we get a somewhat random and pretty bogus third frame in the signature (the most-common signature of those has been added here, a few others are floating around as well).
Kyle, can you look into this?  Thanks.
Flags: needinfo?(khuey)
Yeah, I will dive in next week.
Assignee: nobody → khuey
Flags: needinfo?(khuey)
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)] [@ mozalloc_abort(char const*) | Abort | NS_Debug…
Flags: needinfo?(khuey)
Macs are crashing with:

###!!! ABORT: cycle collector fault: file ../../../../xpcom/base/nsCycleCollector.cpp, line 1155)

Crash Reason 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

[@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | GCGraphBuilder::DescribeRefCountedNode(unsigned int, char const*) ]
Component: XPCOM → DOM: Workers
Flags: needinfo?(khuey)
We will go ahead and track this considering the impact to users.
This is a regression form bug 928312.  It doesn't affect anything before 28.
Blocks: 956284
It's been over a week, can we get a status update on this topcrash?
Flags: needinfo?(khuey)
Well I've been hoping we would get a testcase in bug 959562.
Flags: needinfo?(khuey)
smacleod has a test case for this, it sounds like.  Hurray!
<smacleod>	 mccr8: I'm having an issue when I start a worker up at onquitapplication. The browser crashes with Fault in cycle collector: overflowing refcount (ptr: 0x10af2e360)
I can reproduce with a pretty high frequency on my OSX 10.9.1 machine. I've attached the patch I'm using which causes the error at shutdown.

STR:

- Create a profile with session store set to automatically restore.

- Apply Patch and build (Currently applied to 03070649278e65e31fe9452a259730c7370d177b commit on gecko-dev repository)

- Run the browser
- Close the browser by cmd+q hotkey
- Browser will print timestamp from js inside worker
- Browser will do one of four things:

1. Close and exit process (Usually without properly writing sessionstore.js)

2. Window will close, but process locks up without printing

3. Window will close, but procces locks up and prints:
###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

4. Process prints to console and crashes:
Fault in cycle collector: overflowing refcount (ptr: 0x10b63d800)
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254

The frequency of the fourth case seems to be dependent on the sessionstore.js file, which is what the worker is writing when things crash. Some session will cause the frequency to go up, I will attach a session which causes it occasionally, but not as often as I've previously seen. I suspect there might be a sweet spot of string size sent to the worker to be written.

It should be noted that if you allow 15s to elapse after starting the browser the worker will start before shutdown, and there will be no crash when shutdown occurs.
Blocks: 959130
With Steven's STR the refcount seems to always overflow for a nsDOMEventTargetHelper. Not sure if that's helpful information.
Just a note here: We are experiencing a lot of those crashes in our Mozmill automation for Firefox 28 and 29 throughout a day. See bug 959562 for more details.

I'm a bit worried that this will have a larger impact for QA once the merge from Aurora to Beta happened. If tests are getting aborted due to those crashes, there will be a delay in the sign-off process. So is there anything we can do to get this crasher fixed until next Monday?
This should be fixed by bug 956284.
No longer blocks: 956284
Depends on: 956284
Whiteboard: [most likely fixed by bug 956284]
Steven, could you please test the upcoming Nightly build from today if you can still reproduce this problem? If not bug 956284 really fixed it.
Flags: needinfo?(smacleod)
(In reply to Henrik Skupin (:whimboo) from comment #37)
> Steven, could you please test the upcoming Nightly build from today if you
> can still reproduce this problem? If not bug 956284 really fixed it.

I'm unable to reproduce the cycle collector crash now.

I still get lockups occasionally in this situation, but I think that's just related to Bug 964531
Flags: needinfo?(smacleod)
(In reply to Steven MacLeod [:smacleod] from comment #38)
> I still get lockups occasionally in this situation, but I think that's just
> related to Bug 964531

Investigated the lock ups in bug 965309.
Status: NEW → RESOLVED
Closed: 10 years ago
No longer depends on: 956284
Resolution: --- → DUPLICATE
Thanks again for the test case, Steven!
Andrew, is the testcase something we could add to one of our test suites?
Flags: in-testsuite?
Bug 959562 suggests you have a working test case already ;) But seriously it's a rather intermittent issue caused by spawning workers on shutdown. Might be hard/impossible to reproduce reliably in a test.
We haven't created a minimized testcase for bug 959562. It also failed intermittently. So I was hoping the testcase here on this bug would help better to get it always reproduced.
Yeah, I don't know if there's a test.  Presumably once Stephen's stuff lands, things will fail if we regress this, so that's something.
Group: core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: