1304519 - Crash in jemalloc_crash | arena_ralloc | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_Accumulate

(inactive) Jim Chen [:jchen] [:darchons]

Reporter

Description

•

9 years ago

This bug was filed from the Socorro interface and is report bp-a79d5746-58ca-454c-8b19-908362160921. ============================================================= Infrequent crashes that started with the 9-20 Nightly (currently the #26 and #30 top crashes on Windows for that Nightly). Based on the stack, I think it's a regression from bug 1218576.

Flags: needinfo?(chutten)

Georg Fritzsche [:gfritzsche]

Comment 1

•

9 years ago

This is a content process crash; presumably crashing when we are appending to the gAccumulations buffer. I can't quite tell from the stack whether this is a OOM crash or a different issue? > 0 mozglue.dll jemalloc_crash memory/mozjemalloc/jemalloc.c:1631 > 1 mozglue.dll arena_ralloc memory/mozjemalloc/jemalloc.c:4929 > 2 mozglue.dll moz_xrealloc memory/mozalloc/mozalloc.cpp:105 > 3 xul.dll nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator>(unsigned __int64, unsigned __int64) obj-firefox/dist/include/nsTArray-inl.h:183 > 4 xul.dll nsTArray_Impl<nsString::Segment, nsTArrayInfallibleAllocator>::AppendElement<nsString::Segment, nsTArrayInfallibleAllocator>(nsString::Segment&&) obj-firefox/dist/include/nsTArray.h:2073 > 5 xul.dll `anonymous namespace'::internal_Accumulate toolkit/components/telemetry/TelemetryHistogram.cpp:1357

Priority: -- → P2

Whiteboard: [measurement:client]

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 2

•

9 years ago

Is it possible that my use of nsTArray is flawed in some odd way? The refresh driver [1] does things the same way, at least as far as I can tell. +froydnj: Could you take a second look at the AppendElements in RemoteAccumualte and SwapElements in IPCTimerFired (in TelemetryHistogram.cpp) to make sure I'm not footgunning in some semi-obvious way? I really have no idea what could be going on here unless the mainthread is too crushed to fire the IPC timer... how many accumualtions would there have to be in order to run out of memory? They're only two words. [1]: https://dxr.mozilla.org/mozilla-central/rev/560b2c805bf7bebeb3ceebc495a81b2aa4c0c755/layout/base/nsRefreshDriver.cpp#2200

Flags: needinfo?(chutten) → needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 3

•

9 years ago

OOM seems really unlikely, at least for the crash linked in comment 0; we're running on a 64-bit machine and there's 8TB of virtual memory free. The jemalloc stack also indicates that we're reallocating a block of memory (ralloc) and the actual code indicates that we've successfully allocated a block and we're crashing in the process of deallocating the old block. I think that means memory corruption of some sort, though how that's happening in this instance, I'm not sure.

Flags: needinfo?(nfroyd)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 4

•

9 years ago

We're seeing what could very well be memory corruption in bug 1304009 as well. Is there anything that springs to mind that could be a potential cause?

Kan-Ru Chen [:kanru] (UTC+9)

Updated

•

9 years ago

Andrew McCreight [:mccr8]

Comment 6

•

9 years ago

(In reply to Chris H-C :chutten from comment #4) > We're seeing what could very well be memory corruption in bug 1304009 as > well. Is there anything that springs to mind that could be a potential cause? Have all the telemetry races been cleaned up? Multiple threads accessing the same array could be bad. I suggest this because all 4 of the crashes with the signature in bug 1306544 are calling into telemetry from off the main thread.

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 7

•

9 years ago

Hrm. Several of the reports are coming in over JSHistogram_Add, which is specifically exempt from the mutex[1]. But there are reports coming in that aren't on that path... maybe we just are seeing the fallout instead of the fault. Hrm. ref. bug 1258183 ni? jseward for if these weird things could be easily walked around (say, with a gAccumulationsMutex?) [1]: https://dxr.mozilla.org/mozilla-central/rev/9baec74b3db1bf005c66ae2f50bafbdb02c3be38/toolkit/components/telemetry/TelemetryHistogram.cpp#1454

Flags: needinfo?(jseward)

Nathan Froyd [:froydnj]

Comment 8

•

9 years ago

(In reply to Chris H-C :chutten from comment #7) > Hrm. Several of the reports are coming in over JSHistogram_Add, which is > specifically exempt from the mutex[1]. > > But there are reports coming in that aren't on that path... maybe we just > are seeing the fallout instead of the fault. JSHistogram_Add should always be called on the main thread. But it's certainly possible that it's accumulating into a histogram that's modified on multiple threads. The accumulation itself should be guarded, I think, even if all of the JS stuff doesn't need to be. (One more argument for MutexAutoLock& parameters rather than function naming conventions...)

Andrew McCreight [:mccr8]

Comment 9

•

9 years ago

FWIW, we can have multiple DOM workers, so we could be accumulating worker telemetry from multiple threads.

Comment hidden (mozreview-request)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 12

•

9 years ago

I guess the least invasive approach for the moment would be to lock access and modification to the g(Keyed)Accumulations arrays. The patch for review has passed a short try run [1] and is a simple, straightforward patch. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293

Julian Seward [:jseward]

Comment 13

•

9 years ago

(In reply to Chris H-C :chutten from comment #12) > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293 My main concern here is the possibility of introducing new deadlocks. Avoiding them was a major source of difficulty in bug 1258183,and that is why bug 1258183 does not fully de-race Telemetry. With your proposed patch in comment #12, is there any possible call path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself, either on the same or a different thread? And the same question for |internal_RemoteAccumulate(ID, const nsCString&, uint32_t)? I would also feel more comfortable if you could try this with TSan, so as to validate the theory that these crashes are caused by new races, and to validate the proposed fix. void TelemetryHistogram::IPCTimerFired(nsITimer* aTimer, void* aClosure) { MOZ_ASSERT(NS_IsMainThread()); nsTArray<Accumulation> accumulationsToSend; nsTArray<KeyedAccumulation> keyedAccumulationsToSend; { - StaticMutexAutoLock locker(gTelemetryHistogramMutex); + StaticMutexAutoLock locker(gAccumulationsMutex); if (gAccumulations) { accumulationsToSend.SwapElements(*gAccumulations); } + StaticMutexAutoLock keyLocker(gKeyedAccumulationsMutex); if (gKeyedAccumulations) { keyedAccumulationsToSend.SwapElements(*gKeyedAccumulations); } } One effect of this is that |gAccumulations| and |gKeyedAccumulations| are no longer updated as an atomic unit. Does that matter?

Flags: needinfo?(jseward)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 14

•

9 years ago

(In reply to Julian Seward [:jseward] from comment #13) > (In reply to Chris H-C :chutten from comment #12) > > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293 > > My main concern here is the possibility of introducing new deadlocks. > Avoiding them was a major source of difficulty in bug 1258183,and that > is why bug 1258183 does not fully de-race Telemetry. > > With your proposed patch in comment #12, is there any possible call > path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself, > either on the same or a different thread? And the same question for > |internal_RemoteAccumulate(ID, const nsCString&, uint32_t)? internal_RemoteAccumulate will not call back onto itself so long as armIPCTimer continues not accumulating its own telemetry. > I would also feel more comfortable if you could try this with TSan, so > as to validate the theory that these crashes are caused by new races, > and to validate the proposed fix. "try this with TSan"... is that as simple as following the instructions [1]? > <snip> > One effect of this is that |gAccumulations| and |gKeyedAccumulations| are > no longer updated as an atomic unit. Does that matter? Nope. gAccumulations and gKeyedAccumulations are independent except where they just happen to share the same batching timer, timeout, and size limit. Allowing them to be simultaneously accumulated to is an intended benefit. [1]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer

Flags: needinfo?(jseward)

Nathan Froyd [:froydnj]

Comment 15

•

9 years ago

mozreview-review

Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review81752 Why are we adding more locks rather than locking around the JS histogram calls to `internal_Accumulate*` functions? The whole point of `internal_*` functions is that they already hold the lock, and we're violating that invariant from the JS histograms. Maybe we're violating it from other places, too... I think we need better locking discipline, not more locks.

Attachment #8797216 - Flags: review?(nfroyd)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 16

•

9 years ago

Apparently we have a bug for precisely that.

Depends on: 1278821

Flags: needinfo?(jseward)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Updated

•

9 years ago

No longer depends on: 1278821

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Updated

•

9 years ago

Depends on: 1278821

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Assignee: nobody → chutten

Georg Fritzsche [:gfritzsche]

Comment 17

•

9 years ago

[Tracking Requested - why for this release]: This is a suspected race leading to a crash or other artifacts. It is a likely regression from bug 1218576.

status-firefox51: --- → unaffected

status-firefox52: --- → affected

tracking-firefox52: --- → ?

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Priority: P2 → P1

Julian Seward [:jseward]

Comment 18

•

9 years ago

(In reply to Chris H-C :chutten from comment #14) > "try this with TSan"... is that as simple as following the instructions [1]? > [1]: > https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer Yes, those look they would work. I can also test patches if that helps.

Georg Fritzsche [:gfritzsche]

Comment 19

•

9 years ago

Julian, i think bug 1278821 (JSHistograms_* & JSKeyedHistograms_*) was the only racy code path left for TelemetryHistogram.cpp. Do you agree?

Flags: needinfo?(jseward)

Comment hidden (mozreview-request)

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 21

•

9 years ago

I've put a patch up for review which doesn't necessarily fix _all_ of the racy stuff (which is why I'm posting it here instead of against bug 1278821), but it _does_ quiet TSan's complaints of a data race on the very nsTArray we're worried about. My testing was running dom/media/tests/mochitest/test_peerConnection_* tee'd to an output file which I pawed through for data races.

Georg Fritzsche [:gfritzsche]

Comment 22

•

9 years ago

mozreview-review

Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review82196 ::: toolkit/components/telemetry/TelemetryHistogram.cpp:1512 (Diff revision 2) > Histogram::ClassType type = h->histogram_type(); > > JS::CallArgs args = CallArgsFromVp(argc, vp); > > if (!internal_CanRecordBase()) { > return true; I agree that we can just start with `JS(Keyed)Histogram_Add()` here, we should at least protect the other `internal_*` calls and the `gHistograms` access in this function while we're here. ::: toolkit/components/telemetry/TelemetryHistogram.cpp:1785 (Diff revision 2) > if (!args[0].isString() || !key.init(cx, args[0])) { > JS_ReportErrorASCII(cx, "Not a string"); > return false; > } > > const uint32_t type = keyed->GetHistogramType(); Ditto here for the `keyed` access.

Attachment #8797216 - Flags: review?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 23

•

9 years ago

Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. Catching up on this per IRC, this is ok for me as a start. This basically restores the status in racyness before bug 1218576 and is easy & quick to land to isolate the fallout here. Bug 1278821 can handle the rest.

Attachment #8797216 - Flags: review+

Chris H-C :chutten|PTO (back Oct 23)

Assignee

Comment 24

•

9 years ago

mozreview-review

Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review82208

Attachment #8797216 - Flags: review+

Pulsebot

Comment 25

•

9 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ee7990c78dd9 Ensure we hold a lock before accumulating. r=gfritzsche

Marcia Knous [:marcia]

Comment 26

•

9 years ago

Tracking 52+ for this crash.

tracking-firefox52: ? → +

Carsten Book [:Tomcat]

Comment 27

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ee7990c78dd9

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Andreas Pehrson [:pehrsons]

Updated

•

9 years ago

Updated

•

9 years ago

Version: unspecified → Trunk

Julian Seward [:jseward]

Updated

•

9 years ago

Flags: needinfo?(jseward)