937220 - crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)

Reporter

Description

•

11 years ago

This bug was filed from the Socorro interface and is 
report bp-321d7a93-a8f5-41da-a53a-d275b2131111.
=============================================================

This increased in volume on 20131107 and has been ramping up in volume since then.


0 	mozalloc.dll 	mozalloc_abort(char const * const) 	memory/mozalloc/mozalloc_abort.cpp
1 	xul.dll 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp

Andrew McCreight [:mccr8]

Comment 1

•

11 years ago

Are there reports with any information further up the stack?  The first two frames are really generic.

Tracy Walker [:tracy]

Reporter

Comment 2

•

11 years ago

No, I clicked into a couple dozen random reports and they all look the same.

Robert Kaiser

Comment 3

•

11 years ago

What we do have is an abort message in the App Notes:

xpcom_runtime_abort(###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)

Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak → crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)

Andrew McCreight [:mccr8]

Comment 4

•

11 years ago

That's quite peculiar.  I'm not sure why it isn't printing out a more useful message.

Robert Kaiser

Comment 5

•

11 years ago

Given that cycle collector fault and that it rose up on Nightly at the same time as bug 937191, is this another regression from bug 928312?

Andrew McCreight [:mccr8]

Comment 6

•

11 years ago

Does the file path mean this is a Win64 crash, or is this just built on a Win64 machine or something?

Andrew McCreight [:mccr8]

Comment 7

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #5)
> Given that cycle collector fault and that it rose up on Nightly at the same
> time as bug 937191, is this another regression from bug 928312?

That sounds plausible, but we don't have much to go on here.

Andrew McCreight [:mccr8]

Comment 8

•

11 years ago

If I'm reading the crash-stats report correctly, 100% of these 233 crashes are on AMD, which is unusual.

Andrew McCreight [:mccr8]

Comment 9

•

11 years ago

Also of note, I looked at half a dozen crashes and none of them were on the main thread, which is evidence in favor of comment 5.

Benjamin Smedberg

Comment 10

•

11 years ago

This is a win64 crash: you can tell by looking for "Build Architecture" in the crash report.

I'm looking into whether this is win64-only.

Benjamin Smedberg

Comment 11

•

11 years ago

https://crash-stats.mozilla.com/search/?app_notes=cycle&app_notes=collector&app_notes=fault&version=28.0a1&_facets=version&_facets=signature&_facets=cpu_name&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=cpu_info&_columns=cpu_name shows that the recent spike is almost entirely win64, but multiple computers (not a single installation crashing repeatedly). Cycle collector aborts across other versions are distributed normally.

(Away)

Comment 12

•

11 years ago

I'm seeing both Intel and AMD CPUID's in here, so it doesn't look manufacturer-specific. The "amd64" field just means Win64. What's misleading is that on 64-bit processes, we don't get a "GenuineIntel" or "AuthenticAMD" in label in front of the Family/Model/Stepping notes. This seems to be a general problem, not specific to our code; windbg's !cpuid extension seems to suffer the same problem.

I wonder if this is actually a Win64-only crash or if the stackwalking just happened to make it so that only Win64 builds hit this particular variant of this signature. There appear to be some odd frames between abort and NS_DebugBreak.

Benjamin Smedberg

Comment 13

•

11 years ago

> I wonder if this is actually a Win64-only crash or if the stackwalking just
> happened to make it so that only Win64 builds hit this particular variant of
> this signature

I'm pretty certain that this is win64-only. The crash-stats search I linked wasn't searching by signature, but was searching for "cycle collector fault" in the abort message, which should catch all the variants.

(Away)

Comment 14

•

11 years ago

Pointing the debugger at a local copy of the xul image somehow lets it figure out the stack better. It's an overflowing refcount in DescribeRefCountedNode.

xul!NS_DebugBreak
xul!Fault
xul!GCGraphBuilder::DescribeRefCountedNode
xul!nsDOMEventTargetHelper::cycleCollection::Traverse
xul!mozilla::dom::workers::WorkerPrivateParent<mozilla::dom::workers::WorkerPrivate>::cycleCollection::Traverse
xul!mozilla::dom::workers::XMLHttpRequest::cycleCollection::Traverse
xul!GCGraphBuilder::Traverse
xul!nsCycleCollector::MarkRoots
xul!nsCycleCollector::BeginCollection
xul!nsCycleCollector::Collect
xul!nsCycleCollector_collect
xul!`anonymous namespace'::WorkerJSRuntime::CustomGCCallback
mozjs!Collect
mozjs!js::GC
mozjs!js::DestroyContext
xul!`anonymous namespace'::WorkerThreadRunnable::Run
xul!nsThread::ProcessNextEvent
xul!NS_ProcessNextEvent
xul!nsThread::ThreadFunc
nss3!_PR_NativeRunThread
nss3!pr_root
msvcr100!_callthreadstartex

(Away)

Comment 15

•

11 years ago

I also see the same stack as comment 14 in the signatures containing js::frontend::Parser<js::frontend::FullParseHandler>::noteNameUse. noteNameUse disappears from the stack after the debugger gets to see the full image.

Andrew McCreight [:mccr8]

Comment 16

•

11 years ago

Thanks for the stack!

The faults in that method are:
    if (refCount == 0)
        Fault("zero refcount", mCurrPi);
    if (refCount == UINT32_MAX)
        Fault("overflowing refcount", mCurrPi);

Presumably it is the first one.

Blocks: 928312

(Away)

Comment 17

•

11 years ago

(In reply to Andrew McCreight [:mccr8] from comment #16)
> Thanks for the stack!
> 
> The faults in that method are:
>     if (refCount == 0)
>         Fault("zero refcount", mCurrPi);
>     if (refCount == UINT32_MAX)
>         Fault("overflowing refcount", mCurrPi);
> 
> Presumably it is the first one.

Based on the return address and the string pushed, it looks like the second one (overflow).

Andrew McCreight [:mccr8]

Comment 18

•

11 years ago

Weird.

Andrew McCreight [:mccr8]

Comment 19

•

11 years ago

This is almost certainly something horrible.

Group: core-security

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 20

•

11 years ago

So it sounds like our refcount is underflowing.  Fun.

I had an idea of how to rewrite WorkerPrivate to have less crazy ownership.  Maybe I should just do that.

Daniel Veditz [:dveditz]

Updated

•

11 years ago

Keywords: sec-high

Tracy Walker [:tracy]

Reporter

Updated

•

11 years ago

Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak] → [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)]

Robert Kaiser

Comment 22

•

11 years ago

This continues to be a topcrash on trunk, even if now we get a somewhat random and pretty bogus third frame in the signature (the most-common signature of those has been added here, a few others are floating around as well).

tracking-firefox28: --- → ?

Andrew McCreight [:mccr8]

Comment 23

•

11 years ago

Kyle, can you look into this?  Thanks.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 24

•

11 years ago

Yeah, I will dive in next week.

Assignee: nobody → khuey

Flags: needinfo?(khuey)

Tracy Walker [:tracy]

Reporter

Updated

•

11 years ago

Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak] [@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)] [@ mozalloc_abort(char const*) | Abort | NS_Debug…

Liz Henry (:lizzard) (relman/hg->git project)

Updated

•

11 years ago

status-firefox28: --- → affected

Doug Turner (:dougt)

Updated

•

11 years ago

Flags: needinfo?(khuey)

Tracy Walker [:tracy]

Reporter

Comment 25

•

11 years ago

Macs are crashing with:

###!!! ABORT: cycle collector fault: file ../../../../xpcom/base/nsCycleCollector.cpp, line 1155)

Crash Reason 	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

[@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | GCGraphBuilder::DescribeRefCountedNode(unsigned int, char const*) ]

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Updated

•

11 years ago

Component: XPCOM → DOM: Workers

Flags: needinfo?(khuey)

Benjamin Kerensa [:bkerensa]

Comment 26

•

11 years ago

We will go ahead and track this considering the impact to users.

tracking-firefox28: ? → +

Al Billings [:abillings - ex-MoCo]

Updated

•

11 years ago

status-firefox27: --- → ?

status-firefox29: --- → affected

status-firefox-esr24: --- → ?

tracking-firefox29: --- → +

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 27

•

11 years ago

This is a regression form bug 928312.  It doesn't affect anything before 28.

status-firefox27: ? → unaffected

status-firefox-esr24: ? → unaffected

Andrew McCreight [:mccr8]

Updated

•

11 years ago

Blocks: 956284

u279076

Comment 28

•

10 years ago

It's been over a week, can we get a status update on this topcrash?

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Assignee

Comment 29

•

10 years ago

Well I've been hoping we would get a testcase in bug 959562.

Flags: needinfo?(khuey)

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

10 years ago

Comment 30

•

10 years ago

smacleod has a test case for this, it sounds like.  Hurray!

Andrew McCreight [:mccr8]

Comment 31

•

10 years ago

<smacleod>	 mccr8: I'm having an issue when I start a worker up at onquitapplication. The browser crashes with Fault in cycle collector: overflowing refcount (ptr: 0x10af2e360)

Steven MacLeod [:smacleod]

Comment 32

•

10 years ago

Attached patch Patch - Used for STR — Details — Splinter Review

I can reproduce with a pretty high frequency on my OSX 10.9.1 machine. I've attached the patch I'm using which causes the error at shutdown.

STR:

- Create a profile with session store set to automatically restore.

- Apply Patch and build (Currently applied to 03070649278e65e31fe9452a259730c7370d177b commit on gecko-dev repository)

- Run the browser
- Close the browser by cmd+q hotkey
- Browser will print timestamp from js inside worker
- Browser will do one of four things:

1. Close and exit process (Usually without properly writing sessionstore.js)

2. Window will close, but process locks up without printing

3. Window will close, but procces locks up and prints:
###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID

4. Process prints to console and crashes:
Fault in cycle collector: overflowing refcount (ptr: 0x10b63d800)
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254

The frequency of the fourth case seems to be dependent on the sessionstore.js file, which is what the worker is writing when things crash. Some session will cause the frequency to go up, I will attach a session which causes it occasionally, but not as often as I've previously seen. I suspect there might be a sweet spot of string size sent to the worker to be written.

It should be noted that if you allow 15s to elapse after starting the browser the worker will start before shutdown, and there will be no crash when shutdown occurs.

Steven MacLeod [:smacleod]

Comment 33

•

10 years ago

Attached file sessionstore.js for STR — Details

Steven MacLeod [:smacleod]

Updated

•

10 years ago

Blocks: 959130

Tim Taubert [:ttaubert] (inactive)

Comment 34

•

10 years ago

With Steven's STR the refcount seems to always overflow for a nsDOMEventTargetHelper. Not sure if that's helpful information.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 35

•

10 years ago

Just a note here: We are experiencing a lot of those crashes in our Mozmill automation for Firefox 28 and 29 throughout a day. See bug 959562 for more details.

I'm a bit worried that this will have a larger impact for QA once the merge from Aurora to Beta happened. If tests are getting aborted due to those crashes, there will be a delay in the sign-off process. So is there anything we can do to get this crasher fixed until next Monday?

Tim Taubert [:ttaubert] (inactive)

Comment 36

•

10 years ago

This should be fixed by bug 956284.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

10 years ago

No longer blocks: 956284

Depends on: 956284

Whiteboard: [most likely fixed by bug 956284]

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 37

•

10 years ago

Steven, could you please test the upcoming Nightly build from today if you can still reproduce this problem? If not bug 956284 really fixed it.

Flags: needinfo?(smacleod)

Steven MacLeod [:smacleod]

Comment 38

•

10 years ago

(In reply to Henrik Skupin (:whimboo) from comment #37)
> Steven, could you please test the upcoming Nightly build from today if you
> can still reproduce this problem? If not bug 956284 really fixed it.

I'm unable to reproduce the cycle collector crash now.

I still get lockups occasionally in this situation, but I think that's just related to Bug 964531

Flags: needinfo?(smacleod)

Tim Taubert [:ttaubert] (inactive)

Comment 39

•

10 years ago

(In reply to Steven MacLeod [:smacleod] from comment #38)
> I still get lockups occasionally in this situation, but I think that's just
> related to Bug 964531

Investigated the lock ups in bug 965309.

Andrew McCreight [:mccr8]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

No longer depends on: 956284

Resolution: --- → DUPLICATE

Andrew McCreight [:mccr8]

Comment 41

•

10 years ago

Thanks again for the test case, Steven!

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 42

•

10 years ago

Andrew, is the testcase something we could add to one of our test suites?

Flags: in-testsuite?

Tim Taubert [:ttaubert] (inactive)

Comment 43

•

10 years ago

Bug 959562 suggests you have a working test case already ;) But seriously it's a rather intermittent issue caused by spawning workers on shutdown. Might be hard/impossible to reproduce reliably in a test.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 44

•

10 years ago

We haven't created a minimized testcase for bug 959562. It also failed intermittently. So I was hoping the testcase here on this bug would help better to get it always reproduced.

Andrew McCreight [:mccr8]

Comment 45

•

10 years ago

Yeah, I don't know if there's a test.  Presumably once Stephen's stuff lands, things will fail if we regress this, so that's something.

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

10 years ago

tracking-firefox28: + → ---

tracking-firefox29: + → ---

BMO Automation

Updated

•

9 years ago

Group: core-security → core-security-release

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Group: core-security-release

Patch - Used for STR 10 years ago Steven MacLeod [:smacleod] 5.13 KB, patch		Details \| Diff \| Splinter Review
sessionstore.js for STR 10 years ago Steven MacLeod [:smacleod] 1.22 KB, application/x-javascript		Details