Closed Bug 1262015 Opened 8 years ago Closed 7 years ago

Intermittent browser_wa_reset-01.js | Found a tab after previous test timed out: doc_simple-context.html - | application crashed [@ js::gc::ZoneCellIterImpl::ZoneCellIterImpl(JS::Zone *,js::gc::AllocKind)]

Categories

(Core :: JavaScript: GC, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
e10s + ---
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- affected
firefox50 --- affected

People

(Reporter: KWierso, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Component: Developer Tools: Web Audio Editor → JavaScript: GC
Product: Firefox → Core
Paul, I pinged Terrence about this on IRC and he said there's probably an underlying tracing/rooting issue in this code. Any chance you can take a look?
Flags: needinfo?(padenot)
This might just be another shutdown issue. I put a possible fix for all MediaStreamGraph-related issues in bug 1267600.
Flags: needinfo?(padenot)
This appears to still be hitting with high frequency. Any chance you could take another look, Paul?
Flags: needinfo?(padenot)
I can, but I might need some hints to debug this.

Andrew, if this is a tracing/rooting issue, do we have a way to debug this? I suppose I could push some instrumentation on try and retrigger like crazy or something. I'm afraid I know close to nothing about all this.
Flags: needinfo?(padenot) → needinfo?(continuation)
I see at least two different assertions here, isNurseryAllocAllowed
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=112939#L20448

and rt->gc.nursery.isEmpty()
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-aurora&job_id=2861093

The first was more common in the logs I looked at. This should be starred according to the assertion message, not that there was a crash.

I'm not sure what these assertions mean, maybe Terrence could help.

I did notice that these assertions seem to be happening shortly after a "WARNING: Audio Buffer is not full by the end of the callback." message, so maybe that's related.
Flags: needinfo?(continuation) → needinfo?(terrence)
This means that something tried to allocate a generic object while there was an AutoAssertNoNurseryAlloc on the stack. This happens frequently if someone tries to call a script or use spidermonkey api from a callback where such is not allowed. The same is true of the nursery.isEmpty() assertion. The latter can only happen if script usage occurs in a GC callback.

This will be trivial to track down if we can find a clean crash stack. So far all the ones I've checked have been hopelessly corrupted: e.g. arena_dalloc cannot possibly call js::Interpret. I'll keep looking.
Flags: needinfo?(terrence)
Looking at 10's of stacks from the most recent orangefactor report shows that these are:
  * Only on M-e10s(dt7)
  * Only in debug builds
  * On all versions of windows (although mostly win7)
  * All have a very similar, but essentially broken stack trace

I think this is mostly likely either a miscompilation or some sort of really, really nasty heap corruption. Looking at crashstats for tryNewNurseryGCThing, I see [1]. So this may be an issue we've released. Unfortunately, the stacks on those reports are even more broken.

I'm afraid that this bug is going to require a dmajor level of debugging skill to investigate successfully.

1- https://crash-stats.mozilla.org/signature/?signature=js%3A%3Agc%3A%3AGCRuntime%3A%3AtryNewNurseryObject%3CT%3E&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports
(In reply to Terrence Cole [:terrence] from comment #14)
>   * On all versions of windows (although mostly win7)

As mentioned on IRC, this is probably because WinXP/Win8 only have M-e10s enabled on Ash and on the release branches where volume is obviously much lower. WinXP and Win8 are run on in-house machines still, so that at least makes it seem unlikely to be an issue with AWS machine configs or something.
Sounds like the kind of situation the Uptime team might be interested in too.
See Also: → 1237795, 1240231
Terrence, FWIW, bug 1240231 was hitting on OSX too, so I'm not sure this is a compiler issue unless it's something that manages to affect multiple different ones. But I'm also wondering if it's worth throwing rr-chaos at it at this point to see if we can hit it on Linux too under the right circumstances.
I tracked bug 1237795 down to bug 1132501. Hopefully that helps shed some light on this.
Priority: -- → P3
Something made this stop on trunk around July 20. I wonder what!
Flags: needinfo?(terrence)
Nothing stands out. It probably wouldn't though if it was a heap corruption or undefined behavior.
Flags: needinfo?(terrence)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.