Closed Bug 1304519 Opened 3 years ago Closed 3 years ago

Crash in jemalloc_crash | arena_ralloc | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_Accumulate

Categories

(Toolkit :: Telemetry, defect, P1, critical)

defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox51 --- unaffected
firefox52 + fixed

People

(Reporter: jchen, Assigned: chutten)

References

Details

(Keywords: crash, regression, Whiteboard: [measurement:client])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-a79d5746-58ca-454c-8b19-908362160921.
=============================================================

Infrequent crashes that started with the 9-20 Nightly (currently the #26 and #30 top crashes on Windows for that Nightly). Based on the stack, I think it's a regression from bug 1218576.
Flags: needinfo?(chutten)
This is a content process crash; presumably crashing when we are appending to the gAccumulations buffer.
I can't quite tell from the stack whether this is a OOM crash or a different issue?

> 0 	mozglue.dll 	jemalloc_crash 	memory/mozjemalloc/jemalloc.c:1631
> 1 	mozglue.dll 	arena_ralloc 	memory/mozjemalloc/jemalloc.c:4929
> 2 	mozglue.dll 	moz_xrealloc 	memory/mozalloc/mozalloc.cpp:105
> 3 	xul.dll 	nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator>(unsigned __int64, unsigned __int64) 	obj-firefox/dist/include/nsTArray-inl.h:183
> 4 	xul.dll 	nsTArray_Impl<nsString::Segment, nsTArrayInfallibleAllocator>::AppendElement<nsString::Segment, nsTArrayInfallibleAllocator>(nsString::Segment&&) 	obj-firefox/dist/include/nsTArray.h:2073
> 5 	xul.dll 	`anonymous namespace'::internal_Accumulate 	toolkit/components/telemetry/TelemetryHistogram.cpp:1357
Priority: -- → P2
Whiteboard: [measurement:client]
Is it possible that my use of nsTArray is flawed in some odd way? The refresh driver [1] does things the same way, at least as far as I can tell. +froydnj: Could you take a second look at the AppendElements in RemoteAccumualte and SwapElements in IPCTimerFired (in TelemetryHistogram.cpp) to make sure I'm not footgunning in some semi-obvious way?

I really have no idea what could be going on here unless the mainthread is too crushed to fire the IPC timer... how many accumualtions would there have to be in order to run out of memory? They're only two words.

[1]: https://dxr.mozilla.org/mozilla-central/rev/560b2c805bf7bebeb3ceebc495a81b2aa4c0c755/layout/base/nsRefreshDriver.cpp#2200
Flags: needinfo?(chutten) → needinfo?(nfroyd)
OOM seems really unlikely, at least for the crash linked in comment 0; we're running on a 64-bit machine and there's 8TB of virtual memory free.  The jemalloc stack also indicates that we're reallocating a block of memory (ralloc) and the actual code indicates that we've successfully allocated a block and we're crashing in the process of deallocating the old block.

I think that means memory corruption of some sort, though how that's happening in this instance, I'm not sure.
Flags: needinfo?(nfroyd)
We're seeing what could very well be memory corruption in bug 1304009 as well. Is there anything that springs to mind that could be a potential cause?
Duplicate of this bug: 1306544
Crash Signature: nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_RemoteAccumulate] → nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_RemoteAccumulate] [@ jemalloc_crash | je_realloc | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | inte…
(In reply to Chris H-C :chutten from comment #4)
> We're seeing what could very well be memory corruption in bug 1304009 as
> well. Is there anything that springs to mind that could be a potential cause?

Have all the telemetry races been cleaned up? Multiple threads accessing the same array could be bad. I suggest this because all 4 of the crashes with the signature in bug 1306544 are calling into telemetry from off the main thread.
Hrm. Several of the reports are coming in over JSHistogram_Add, which is specifically exempt from the mutex[1].

But there are reports coming in that aren't on that path... maybe we just are seeing the fallout instead of the fault.

Hrm. 

ref. bug 1258183 ni? jseward for if these weird things could be easily walked around (say, with a gAccumulationsMutex?)

[1]: https://dxr.mozilla.org/mozilla-central/rev/9baec74b3db1bf005c66ae2f50bafbdb02c3be38/toolkit/components/telemetry/TelemetryHistogram.cpp#1454
Flags: needinfo?(jseward)
(In reply to Chris H-C :chutten from comment #7)
> Hrm. Several of the reports are coming in over JSHistogram_Add, which is
> specifically exempt from the mutex[1].
> 
> But there are reports coming in that aren't on that path... maybe we just
> are seeing the fallout instead of the fault.

JSHistogram_Add should always be called on the main thread.  But it's certainly possible that it's accumulating into a histogram that's modified on multiple threads.  The accumulation itself should be guarded, I think, even if all of the JS stuff doesn't need to be.  (One more argument for MutexAutoLock& parameters rather than function naming conventions...)
FWIW, we can have multiple DOM workers, so we could be accumulating worker telemetry from multiple threads.
Duplicate of this bug: 1307031
I guess the least invasive approach for the moment would be to lock access and modification to the g(Keyed)Accumulations arrays. The patch for review has passed a short try run [1] and is a simple, straightforward patch.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293
(In reply to Chris H-C :chutten from comment #12)
> [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293

My main concern here is the possibility of introducing new deadlocks.
Avoiding them was a major source of difficulty in bug 1258183,and that
is why bug 1258183 does not fully de-race Telemetry.

With your proposed patch in comment #12, is there any possible call
path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself,
either on the same or a different thread?  And the same question for
|internal_RemoteAccumulate(ID, const nsCString&, uint32_t)?

I would also feel more comfortable if you could try this with TSan, so
as to validate the theory that these crashes are caused by new races,
and to validate the proposed fix.

 void
 TelemetryHistogram::IPCTimerFired(nsITimer* aTimer, void* aClosure)
 {
   MOZ_ASSERT(NS_IsMainThread());
   nsTArray<Accumulation> accumulationsToSend;
   nsTArray<KeyedAccumulation> keyedAccumulationsToSend;
   {
-    StaticMutexAutoLock locker(gTelemetryHistogramMutex);
+    StaticMutexAutoLock locker(gAccumulationsMutex);
     if (gAccumulations) {
       accumulationsToSend.SwapElements(*gAccumulations);
     }
+    StaticMutexAutoLock keyLocker(gKeyedAccumulationsMutex);
     if (gKeyedAccumulations) {
       keyedAccumulationsToSend.SwapElements(*gKeyedAccumulations);
     }
   }

One effect of this is that |gAccumulations| and |gKeyedAccumulations| are
no longer updated as an atomic unit.  Does that matter?
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #13)
> (In reply to Chris H-C :chutten from comment #12)
> > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293
> 
> My main concern here is the possibility of introducing new deadlocks.
> Avoiding them was a major source of difficulty in bug 1258183,and that
> is why bug 1258183 does not fully de-race Telemetry.
> 
> With your proposed patch in comment #12, is there any possible call
> path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself,
> either on the same or a different thread?  And the same question for
> |internal_RemoteAccumulate(ID, const nsCString&, uint32_t)?

internal_RemoteAccumulate will not call back onto itself so long as armIPCTimer continues not accumulating its own telemetry. 

> I would also feel more comfortable if you could try this with TSan, so
> as to validate the theory that these crashes are caused by new races,
> and to validate the proposed fix.

"try this with TSan"... is that as simple as following the instructions [1]?

> <snip>
> One effect of this is that |gAccumulations| and |gKeyedAccumulations| are
> no longer updated as an atomic unit.  Does that matter?

Nope. gAccumulations and gKeyedAccumulations are independent except where they just happen to share the same batching timer, timeout, and size limit. Allowing them to be simultaneously accumulated to is an intended benefit.

[1]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer
Flags: needinfo?(jseward)
Comment on attachment 8797216 [details]
bug 1304519 - Ensure we hold a lock before accumulating.

https://reviewboard.mozilla.org/r/82820/#review81752

Why are we adding more locks rather than locking around the JS histogram calls to `internal_Accumulate*` functions?  The whole point of `internal_*` functions is that they already hold the lock, and we're violating that invariant from the JS histograms.  Maybe we're violating it from other places, too...

I think we need better locking discipline, not more locks.
Attachment #8797216 - Flags: review?(nfroyd)
Apparently we have a bug for precisely that.
Depends on: 1278821
Flags: needinfo?(jseward)
No longer depends on: 1278821
Depends on: 1278821
Assignee: nobody → chutten
[Tracking Requested - why for this release]:
This is a suspected race leading to a crash or other artifacts.
It is a likely regression from bug 1218576.
Priority: P2 → P1
(In reply to Chris H-C :chutten from comment #14)
> "try this with TSan"... is that as simple as following the instructions [1]?
> [1]:
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer

Yes, those look they would work.  I can also test patches if that helps.
Julian, i think bug 1278821 (JSHistograms_* & JSKeyedHistograms_*) was the only racy code path left for TelemetryHistogram.cpp.
Do you agree?
Flags: needinfo?(jseward)
I've put a patch up for review which doesn't necessarily fix _all_ of the racy stuff (which is why I'm posting it here instead of against bug 1278821), but it _does_ quiet TSan's complaints of a data race on the very nsTArray we're worried about.

My testing was running dom/media/tests/mochitest/test_peerConnection_* tee'd to an output file which I pawed through for data races.
Comment on attachment 8797216 [details]
bug 1304519 - Ensure we hold a lock before accumulating.

https://reviewboard.mozilla.org/r/82820/#review82196

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1512
(Diff revision 2)
>    Histogram::ClassType type = h->histogram_type();
>  
>    JS::CallArgs args = CallArgsFromVp(argc, vp);
>  
>    if (!internal_CanRecordBase()) {
>      return true;

I agree that we can just start with `JS(Keyed)Histogram_Add()` here, we should at least protect the other `internal_*` calls and the `gHistograms` access in this function while we're here.

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1785
(Diff revision 2)
>    if (!args[0].isString() || !key.init(cx, args[0])) {
>      JS_ReportErrorASCII(cx, "Not a string");
>      return false;
>    }
>  
>    const uint32_t type = keyed->GetHistogramType();

Ditto here for the `keyed` access.
Attachment #8797216 - Flags: review?(gfritzsche)
Comment on attachment 8797216 [details]
bug 1304519 - Ensure we hold a lock before accumulating.

Catching up on this per IRC, this is ok for me as a start.
This basically restores the status in racyness before bug 1218576 and is easy & quick to land to isolate the fallout here.

Bug 1278821 can handle the rest.
Attachment #8797216 - Flags: review+
Comment on attachment 8797216 [details]
bug 1304519 - Ensure we hold a lock before accumulating.

https://reviewboard.mozilla.org/r/82820/#review82208
Attachment #8797216 - Flags: review+
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ee7990c78dd9
Ensure we hold a lock before accumulating. r=gfritzsche
Tracking 52+ for this crash.
https://hg.mozilla.org/mozilla-central/rev/ee7990c78dd9
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Duplicate of this bug: 1304009
Duplicate of this bug: 1292429
Duplicate of this bug: 1304580
Duplicate of this bug: 1304935
Duplicate of this bug: 1304940
Duplicate of this bug: 1306803
Duplicate of this bug: 1307469
Version: unspecified → Trunk
Flags: needinfo?(jseward)
You need to log in before you can comment on or make changes to this bug.