Closed Bug 1132502 Opened 9 years ago Closed 8 years ago

crash in js::detail::HashTable<T>::lookupForAdd

Categories

(Core :: JavaScript Engine, defect, P1)

42 Branch
x86
Windows NT
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox36 --- affected
firefox37 --- affected
firefox41 --- affected
firefox42 --- affected
firefox44 --- affected
firefox45 --- affected
firefox46 --- affected
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: alex_mayorga, Assigned: n.nethercote)

References

Details

(Keywords: crash, Whiteboard: [caf priority: p2][CR 1004423])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-67343b27-8079-489a-a04f-6c2972150212.
=============================================================
Component: General → JavaScript Engine
Product: Firefox → Core
See Also: → 1180954
Crash Signature: [@ js::detail::HashTable<js::HashMapEntry<char const*, JS::ClassInfo>, js::HashMap<char const*, JS::ClassInfo, js::CStringHashPolicy, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::lookupForAdd(char const* const&)] → [@ js::detail::HashTable<js::HashMapEntry<char const*, JS::ClassInfo>, js::HashMap<char const*, JS::ClassInfo, js::CStringHashPolicy, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::lookupForAdd(char const* const&)] [@ js::detail::HashTabl…
Priority: -- → P1
Version: 36 Branch → 42 Branch
¡Hola!

There were 743 crashes in the past week per https://crash-stats.mozilla.com/report/list?product=Firefox&signature=js%3A%3Adetail%3A%3AHashTable%3CT%3E%3A%3AlookupForAdd

bp-f3a3a531-609f-4933-bf1e-64b6f2160216
	16/02/2016	03:52 p.m.

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	js::detail::HashTable<js::HashMapEntry<char const*, JS::ClassInfo>, js::HashMap<char const*, JS::ClassInfo, js::CStringHashPolicy, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::lookupForAdd(char const* const&) 	js/public/HashTable.h
1 	xul.dll 	AddClassInfo 	js/src/vm/MemoryMetrics.cpp
2 	xul.dll 	StatsCellCallback<0> 	js/src/vm/MemoryMetrics.cpp
3 	xul.dll 	IterateCompartmentsArenasCells 	js/src/gc/Iteration.cpp
4 	xul.dll 	js::IterateZonesCompartmentsArenasCells(JSRuntime*, void*, void (*)(JSRuntime*, void*, JS::Zone*), void (*)(JSRuntime*, void*, JSCompartment*), void (*)(JSRuntime*, void*, js::gc::Arena*, JS::TraceKind, unsigned int), void (*)(JSRuntime*, void*, void*, JS::TraceKind, unsigned int)) 	js/src/gc/Iteration.cpp
5 	xul.dll 	StatsCompartmentCallback 	js/src/vm/MemoryMetrics.cpp
6 	xul.dll 	nsTHashtable<nsISupportsHashKey>::nsTHashtable<nsISupportsHashKey>() 	xpcom/glue/nsTHashtable.h
7 	xul.dll 	xpc::JSReporter::CollectReports(nsDataHashtable<nsUint64HashKey, nsCString>*, nsDataHashtable<nsUint64HashKey, nsCString>*, nsIMemoryReporterCallback*, nsISupports*, bool) 	js/xpconnect/src/XPCJSRuntime.cpp
8 	xul.dll 	nsWindowMemoryReporter::CollectReports(nsIMemoryReporterCallback*, nsISupports*, bool) 	dom/base/nsWindowMemoryReporter.cpp
When i look at mozilla crash reporter/other reports/reports, there are about 75% of those errors that happen at address 0xffffffffe5e5e5e5

Doesn't that information help in pinpointing the faulty code location?
Flags: needinfo?(nihsanullah)
Terrence this looks similar to 

https://bugzilla.mozilla.org/show_bug.cgi?id=1189934 and maybe https://bugzilla.mozilla.org/show_bug.cgi?id=1180954
Flags: needinfo?(nihsanullah) → needinfo?(terrence)
Any sort of heap corruption appears likely to hit this signature, so it is probably unrelated if there is a new spike.
If we expand to look at the 28 day table view, we can see that the 2016031515 build has 2521 crashes, where all the other builds are metastable between 1-4 crashes and tens to a few hundred. I think this massive spike is almost certainly an isolated incident that has already been fixed. The smaller spikes are probably in the same category for less noticeable bugs.

Clearing the ni? as there's nothing else I can say here without STR.
Flags: needinfo?(terrence)
One of my desktops gets this crash about once per day, is there something i can do to help diagnose the issue?
(In reply to bugzilla.m7 from comment #8)
> One of my desktops gets this crash about once per day, is there something i
> can do to help diagnose the issue?

Yes, there is. We suspect that some large portion of these random-looking crashes are caused by the roughly 8%[1] RAM failure rate. If you could run memtest86 overnight, that might give us some data to say whether that is likely cause or not. Thanks!

1- http://www.zdnet.com/article/dram-error-rates-nightmare-on-dimm-street/
Hmm, don't you think that i would get BSOD if i had faulty memory? and why would firefox be the only app to crash? and the fact that 75% of crashes are at address 0xffffffffe5e5e5e5 ? does it not mean that this is not a random crash?
(In reply to bugzilla.m7 from comment #10)
> Hmm, don't you think that i would get BSOD if i had faulty memory?

Not necessarily. We know a number of cases where people had crashes in GC code with varying signatures where faulty RAM was actually the issue. That said, if you always get this signature, I'd think it to be less likely.

> and the fact that 75% of crashes are at address 0xffffffffe5e5e5e5 ?

That actually makes me wonder more. The leading ffff... may be bogus, the e5e5... pattern points to memory that we have previously freed and that should not happen that consistently. There still could be some interaction with a graphics driver or some other 3rd-party (e.g. security) software running in our process that could play a role there. Anything you know that makes this machines different from others?
> Anything you know that makes this machines different from others?

Well, i checked the configurations here:
https://crash-stats.mozilla.com/report/list?signature=js%3A%3Adetail%3A%3AHashTable%3CT%3E%3A%3AlookupForAdd&range_value=7&range_unit=days&date=2016-04-14#tab-reports

and it seems to happen to all sorts of OSSes, ffox versions...

The only pattern i could find was this strange address 0xffffffffe5e5e5e5 common to 70% of all reports (not only mine), 13 of 19 pages of report during the last 28 days have this address.

FYI the anti-virus on the offending PC is the free version of avast.

PS: is there a way to get a csv of the reports data? i could run some analysis on it and check for other correlations
> 0xffffffffe5e5e5e5

Looks like jemalloc freed memory poison value that has been sign extended.
I can take a quick look at some crash dumps.
Flags: needinfo?(jdemooij)
(In reply to bugzilla.m7 from comment #10)
> Hmm, don't you think that i would get BSOD if i had faulty memory? and why
> would firefox be the only app to crash? and the fact that 75% of crashes are
> at address 0xffffffffe5e5e5e5 ? does it not mean that this is not a random
> crash?

Sorry, I think I got unlucky and happened to only see crashes without the jemalloc free pattern.

That said, it's entirely possible to not get a BSOD with bad RAM. Most core OS software is written in C/C++, which is comparatively sparse in pointers, has a fairly fixed static size, and doesn't walk the heap continuously. Comparatively, firefox is huge (and hugely dynamic), mostly pointers, and continuously touching all of them.
I looked at 3 different crash reports (I picked random FF 45.0.1 reports with high values for uptime and last crash).

We're crashing in StatsCellCallback because we have a BaseShape with clasp_->name == 0xe5e5e5e5. That's a bit weird - if we had a bogus BaseShape, I'd expect us to crash accessing clasp_ instead of clasp_->name.

Most Classes have a statically-allocated name, but we do have this weird XPCNativeScriptableShared thing that manufactures Classes on the fly. Maybe that's causing problems somehow - shouldn't be too hard to add instrumentation to confirm that.
njn, see comment 16. You did some JSClass work recently and I vaguely remember you running into crashy BaseShape stuff in this code.
After reading more XPConnect code I'm pretty sure this is indeed an XPCNativeScriptableShared Class. I looked at ~30 crash reports and they all have (non-standard) add-ons.

Question is what's the best way to fix it. We could set a flag on the BaseShape if clasp_->isWrappedNative(). Then we know it's unsafe to access clasp_, because it might be dynamically allocated.

It's not very elegant and there's a risk of this footgun showing up elsewhere (like ObjectGroup::clasp_), but it gets the job done without adding a lot of complexity.
Summary: crash in js::detail::HashTable<js::HashMapEntry<char const*, JS::ClassInfo>, js::HashMap<char const*, JS::ClassInfo, js::CStringHashPolicy, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::lookupForAdd(char const* const&) → crash in js::detail::HashTable<T>::lookupForAdd
Jan: thank you for the extra info.

I agree that this doesn't look like generic heap corruption or faulty hardware. It's much too regular for that. We also have bug 1243529 which is an intermittent orange that looks like it has the same root cause. I had a look at three minidumps yesterday and found some interesting stuff; I'll now take a look at what Jan found and see how that fits into the puzzle.
Assignee: nobody → n.nethercote
Blocks: 1243529
The code in StatsCellCallback() looks like this:

> case JS::TraceKind::BaseShape: {
>   BaseShape* base = static_cast<BaseShape*>(thing);
>   CompartmentStats* cStats = GetCompartmentStats(base->compartment());
> 
>   JS::ClassInfo info;        // This zeroes all the sizes.
>   info.shapesGCHeapBase += thingSize;
>   // No malloc-heap measurements.
> 
>   cStats->classInfo.add(info);
> 
>   const Class* clasp = base->clasp();
>   const char* className = clasp->name;
>   AddClassInfo(granularity, cStats, className, info);
>   break;
> }

AddClassInfo() looks like this:

> static void
> AddClassInfo(Granularity granularity, CompartmentStats* cStats, const char* className,
>              JS::ClassInfo& info)
> {
>     if (granularity == FineGrained) {
>         if (!className)
>             className = "<no class name>";
>         CompartmentStats::ClassesHashMap::AddPtr p =
>             cStats->allClasses->lookupForAdd(className);      // CRASH!
>         if (!p) {
>             bool ok = cStats->allClasses->add(p, className, info);
>             // Ignore failure -- we just won't record the
>             // object/shape/base-shape as notable.
>             (void)ok;
>         } else {
>             p->value().add(info);
>         }
>     }
> }

In the three minidumps I looked at, in AddInfo():
- cStats was nullptr
- className was something bogus (I saw 0x6, 0x8a, 0xe5e5e5e5)

It looks like the crash is actually occurring when trying to hash the
className, which is unsurprising, given that className is bogus. If cStats is
truly null, we probably would have crashed earlier, in the
|cStats->classInfo.add(info);| call, or earlier in the
|cStats->allClasses->lookupForAdd()| sequence. Furthermore, based on how the
memory reporting works, I can't see how cStats could be null.

So I don't really trust the cStats==nullptr claim, though I will change
GetCompartmentStats() to check that. That way, if it really is nullptr, we'll
start getting a new crash signature with a clearer cause.

Now, onto Jan's theory about XPCNativeScriptableShared. If true, then we must
have some BaseShapes with a clasp that points to a js::Class in
XPCNativeScriptShared that has been freed, which seems dangerous in general.
Why doesn't this cause problems elsewhere? Perhaps those BaseShapes are
unreachable and so not touched by "normal" execution, but memory reporting does
touch them because it touches every GC cell whether reachable or not. I asked
mrbkap and he thought this was plausible.

So I'll try to write a second patch that adds a flag indicating if a BaseShape
points to a heap-allocated class. I also wonder if this flag should be
consulted when measuring objects and shapes, which are handled similarly.
Flags: needinfo?(jdemooij)
We have inconclusive evidence that compartmentStats is sometimes nullptr during
memory reporting, which would be bad. This patch makes us abort in that case.
It also changes some pointers to references to make the expected non-nullness
clearer.
Attachment #8741591 - Flags: review?(jdemooij)
I looked into adding the flag to BaseShape. But it was a bit painful, and I
wouldn't have any way of testing it. Furthermore, bholley says that
XPCNativeScriptableShared is likely to be removed reasonably soon. So I went
with the simpler option of just skipping this code for all BaseShapes. The
consequences are minor and are described in a comment in the patch.
Attachment #8741637 - Flags: review?(jdemooij)
Comment on attachment 8741591 [details] [diff] [review]
(part 1) - Abort if compartmentStats is null during memory reporting

Review of attachment 8741591 [details] [diff] [review]:
-----------------------------------------------------------------

FWIW, did Visual Studio say it was nullptr? I don't really trust its "Locals" window when debugging opt builds - when in doubt I usually try to infer stuff from the code and the stack (the return addresses on the stack are useful to figure out where we are).

This is a nice refactoring no matter what, though.

::: js/src/jscompartment.h
@@ +764,5 @@
> +
> +  public:
> +    // This should only be called when it is non-null, i.e. during memory
> +    // reporting.
> +    JS::CompartmentStats& compartmentStats() {

Nit: To avoid unexpected copies, maybe delete CompartmentStat's copy constructor?

CompartmentStats(const CompartmentStats& other) = delete;
Attachment #8741591 - Flags: review?(jdemooij) → review+
Comment on attachment 8741637 [details] [diff] [review]
(part 2) - Don't call AddClassInfo() for BaseShapes

Review of attachment 8741637 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking this bug.

The sooner these fake Classes go away the better, it's too easy to assume Class pointers stay valid.

::: js/src/vm/MemoryMetrics.cpp
@@ +534,5 @@
> +        // non-trivial change, and heap-allocated js::Class instances are
> +        // likely to go away soon.
> +        //
> +        // So for now we just skip this code for all BaseShapes. The
> +        // consequence is that means that all BaseShapes will show up in

Nit: s/that means//
Attachment #8741637 - Flags: review?(jdemooij) → review+
(In reply to Nicholas Nethercote [:njn] from comment #20)
> I also wonder if this flag should be
> consulted when measuring objects and shapes, which are handled similarly.

Objects are probably fine, because as long as we have an object with this class, it seems the XPConnect code will keep the Class alive. When the object is dead/swept, I think the iteration code here shouldn't visit it anymore.

It seems we'll need something similar for Shapes though? I can imagine JIT caches etc holding onto a shape even though there are no live objects with that shape..
Flags: needinfo?(n.nethercote)
(In reply to Jan de Mooij [:jandem] from comment #23)
> 
> FWIW, did Visual Studio say it was nullptr?

Yes, for three separate minidumps.


> ::: js/src/jscompartment.h
> @@ +764,5 @@
> > +
> > +  public:
> > +    // This should only be called when it is non-null, i.e. during memory
> > +    // reporting.
> > +    JS::CompartmentStats& compartmentStats() {
> 
> Nit: To avoid unexpected copies, maybe delete CompartmentStat's copy
> constructor?
> 
> CompartmentStats(const CompartmentStats& other) = delete;

compartmentStats() returns a reference. Is there a risk of a copy occurring? I would surely hope not...
Flags: needinfo?(n.nethercote)
> It seems we'll need something similar for Shapes though? I can imagine JIT
> caches etc holding onto a shape even though there are no live objects with
> that shape..

I haven't seen any crashes for Shapes. Also, Shapes use a lot more memory than BaseShapes, so losing the per-class reporting would be a shame. So for the moment I will just do this for BaseShapes.
(In reply to Nicholas Nethercote [:njn] from comment #27)
> 
> I haven't seen any crashes for Shapes. Also, Shapes use a lot more memory
> than BaseShapes, so losing the per-class reporting would be a shame. So for
> the moment I will just do this for BaseShapes.

I just saw https://crash-stats.mozilla.com/report/index/e2d28f14-f41d-495c-acd1-be88d2150731 from bug 1189934 and it *is* a Shape crash. Bleh.
https://hg.mozilla.org/integration/mozilla-inbound/rev/223a5febb34f8dfdb326d15e4736c82d51a43811
Bug 1132502 (part 1) - Abort if compartmentStats is null during memory reporting. r=jandem.
Keywords: leave-open
Attachment #8741591 - Flags: checkin+
Whiteboard: [CR 1004423]
Whiteboard: [CR 1004423] → [caf priority: p2][CR 1004423]
Comment on attachment 8741591 [details] [diff] [review]
(part 1) - Abort if compartmentStats is null during memory reporting

Approval Request Comment

[Feature/regressing bug #]: N/A.

[User impact if declined]: Crashes when looking at about:memory.

[Describe test coverage new/current, TreeHerder]: No specific tests. Has been on trunk for a week.

[Risks and why]: These two patches are aimed at addressing a moderately frequent crash. The first patch is very simple, and just adds a null check that will crash on failure, which is mostly for diagnostic purposes. The second patch is slightly more complex, but just disables a particular path in memory reporting that is crashy. (In case it's not clear, I want this uplifted to 47, which may be Beta by the time this request is assessed. I don't want it uplifted to 46.)

[String/UUID change made/needed]: none.
Attachment #8741591 - Flags: approval-mozilla-beta?
Attachment #8741637 - Flags: approval-mozilla-beta?
Comment on attachment 8741591 [details] [diff] [review]
(part 1) - Abort if compartmentStats is null during memory reporting

Crash fix that has stabilized in Nightly for a week, Beta47+
Attachment #8741591 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8741637 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.