Closed
Bug 1304519
Opened 8 years ago
Closed 8 years ago
Crash in jemalloc_crash | arena_ralloc | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_Accumulate
Categories
(Toolkit :: Telemetry, defect, P1)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla52
Tracking | Status | |
---|---|---|
firefox51 | --- | unaffected |
firefox52 | + | fixed |
People
(Reporter: jchen, Assigned: chutten)
References
Details
(Keywords: crash, regression, Whiteboard: [measurement:client])
Crash Data
Attachments
(1 file)
This bug was filed from the Socorro interface and is report bp-a79d5746-58ca-454c-8b19-908362160921. ============================================================= Infrequent crashes that started with the 9-20 Nightly (currently the #26 and #30 top crashes on Windows for that Nightly). Based on the stack, I think it's a regression from bug 1218576.
Flags: needinfo?(chutten)
Comment 1•8 years ago
|
||
This is a content process crash; presumably crashing when we are appending to the gAccumulations buffer.
I can't quite tell from the stack whether this is a OOM crash or a different issue?
> 0 mozglue.dll jemalloc_crash memory/mozjemalloc/jemalloc.c:1631
> 1 mozglue.dll arena_ralloc memory/mozjemalloc/jemalloc.c:4929
> 2 mozglue.dll moz_xrealloc memory/mozalloc/mozalloc.cpp:105
> 3 xul.dll nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator>(unsigned __int64, unsigned __int64) obj-firefox/dist/include/nsTArray-inl.h:183
> 4 xul.dll nsTArray_Impl<nsString::Segment, nsTArrayInfallibleAllocator>::AppendElement<nsString::Segment, nsTArrayInfallibleAllocator>(nsString::Segment&&) obj-firefox/dist/include/nsTArray.h:2073
> 5 xul.dll `anonymous namespace'::internal_Accumulate toolkit/components/telemetry/TelemetryHistogram.cpp:1357
Priority: -- → P2
Whiteboard: [measurement:client]
Assignee | ||
Comment 2•8 years ago
|
||
Is it possible that my use of nsTArray is flawed in some odd way? The refresh driver [1] does things the same way, at least as far as I can tell. +froydnj: Could you take a second look at the AppendElements in RemoteAccumualte and SwapElements in IPCTimerFired (in TelemetryHistogram.cpp) to make sure I'm not footgunning in some semi-obvious way? I really have no idea what could be going on here unless the mainthread is too crushed to fire the IPC timer... how many accumualtions would there have to be in order to run out of memory? They're only two words. [1]: https://dxr.mozilla.org/mozilla-central/rev/560b2c805bf7bebeb3ceebc495a81b2aa4c0c755/layout/base/nsRefreshDriver.cpp#2200
Flags: needinfo?(chutten) → needinfo?(nfroyd)
Comment 3•8 years ago
|
||
OOM seems really unlikely, at least for the crash linked in comment 0; we're running on a 64-bit machine and there's 8TB of virtual memory free. The jemalloc stack also indicates that we're reallocating a block of memory (ralloc) and the actual code indicates that we've successfully allocated a block and we're crashing in the process of deallocating the old block. I think that means memory corruption of some sort, though how that's happening in this instance, I'm not sure.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 4•8 years ago
|
||
We're seeing what could very well be memory corruption in bug 1304009 as well. Is there anything that springs to mind that could be a potential cause?
Updated•8 years ago
|
Crash Signature: nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_RemoteAccumulate] → nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | `anonymous namespace''::internal_RemoteAccumulate]
[@ jemalloc_crash | je_realloc | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | nsTArray_Impl<T>::AppendElement<T> | inte…
Comment 6•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #4) > We're seeing what could very well be memory corruption in bug 1304009 as > well. Is there anything that springs to mind that could be a potential cause? Have all the telemetry races been cleaned up? Multiple threads accessing the same array could be bad. I suggest this because all 4 of the crashes with the signature in bug 1306544 are calling into telemetry from off the main thread.
Assignee | ||
Comment 7•8 years ago
|
||
Hrm. Several of the reports are coming in over JSHistogram_Add, which is specifically exempt from the mutex[1]. But there are reports coming in that aren't on that path... maybe we just are seeing the fallout instead of the fault. Hrm. ref. bug 1258183 ni? jseward for if these weird things could be easily walked around (say, with a gAccumulationsMutex?) [1]: https://dxr.mozilla.org/mozilla-central/rev/9baec74b3db1bf005c66ae2f50bafbdb02c3be38/toolkit/components/telemetry/TelemetryHistogram.cpp#1454
Flags: needinfo?(jseward)
Comment 8•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #7) > Hrm. Several of the reports are coming in over JSHistogram_Add, which is > specifically exempt from the mutex[1]. > > But there are reports coming in that aren't on that path... maybe we just > are seeing the fallout instead of the fault. JSHistogram_Add should always be called on the main thread. But it's certainly possible that it's accumulating into a histogram that's modified on multiple threads. The accumulation itself should be guarded, I think, even if all of the JS stuff doesn't need to be. (One more argument for MutexAutoLock& parameters rather than function naming conventions...)
Comment 9•8 years ago
|
||
FWIW, we can have multiple DOM workers, so we could be accumulating worker telemetry from multiple threads.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 12•8 years ago
|
||
I guess the least invasive approach for the moment would be to lock access and modification to the g(Keyed)Accumulations arrays. The patch for review has passed a short try run [1] and is a simple, straightforward patch. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293
Comment 13•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #12) > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293 My main concern here is the possibility of introducing new deadlocks. Avoiding them was a major source of difficulty in bug 1258183,and that is why bug 1258183 does not fully de-race Telemetry. With your proposed patch in comment #12, is there any possible call path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself, either on the same or a different thread? And the same question for |internal_RemoteAccumulate(ID, const nsCString&, uint32_t)? I would also feel more comfortable if you could try this with TSan, so as to validate the theory that these crashes are caused by new races, and to validate the proposed fix. void TelemetryHistogram::IPCTimerFired(nsITimer* aTimer, void* aClosure) { MOZ_ASSERT(NS_IsMainThread()); nsTArray<Accumulation> accumulationsToSend; nsTArray<KeyedAccumulation> keyedAccumulationsToSend; { - StaticMutexAutoLock locker(gTelemetryHistogramMutex); + StaticMutexAutoLock locker(gAccumulationsMutex); if (gAccumulations) { accumulationsToSend.SwapElements(*gAccumulations); } + StaticMutexAutoLock keyLocker(gKeyedAccumulationsMutex); if (gKeyedAccumulations) { keyedAccumulationsToSend.SwapElements(*gKeyedAccumulations); } } One effect of this is that |gAccumulations| and |gKeyedAccumulations| are no longer updated as an atomic unit. Does that matter?
Flags: needinfo?(jseward)
Assignee | ||
Comment 14•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #13) > (In reply to Chris H-C :chutten from comment #12) > > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=846471eec293 > > My main concern here is the possibility of introducing new deadlocks. > Avoiding them was a major source of difficulty in bug 1258183,and that > is why bug 1258183 does not fully de-race Telemetry. > > With your proposed patch in comment #12, is there any possible call > path from |internal_RemoteAccumulate(ID, uint32_t)| back to itself, > either on the same or a different thread? And the same question for > |internal_RemoteAccumulate(ID, const nsCString&, uint32_t)? internal_RemoteAccumulate will not call back onto itself so long as armIPCTimer continues not accumulating its own telemetry. > I would also feel more comfortable if you could try this with TSan, so > as to validate the theory that these crashes are caused by new races, > and to validate the proposed fix. "try this with TSan"... is that as simple as following the instructions [1]? > <snip> > One effect of this is that |gAccumulations| and |gKeyedAccumulations| are > no longer updated as an atomic unit. Does that matter? Nope. gAccumulations and gKeyedAccumulations are independent except where they just happen to share the same batching timer, timeout, and size limit. Allowing them to be simultaneously accumulated to is an intended benefit. [1]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer
Flags: needinfo?(jseward)
Comment 15•8 years ago
|
||
mozreview-review |
Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review81752 Why are we adding more locks rather than locking around the JS histogram calls to `internal_Accumulate*` functions? The whole point of `internal_*` functions is that they already hold the lock, and we're violating that invariant from the JS histograms. Maybe we're violating it from other places, too... I think we need better locking discipline, not more locks.
Attachment #8797216 -
Flags: review?(nfroyd)
Assignee | ||
Comment 16•8 years ago
|
||
Apparently we have a bug for precisely that.
Depends on: 1278821
Flags: needinfo?(jseward)
Updated•8 years ago
|
Assignee: nobody → chutten
Comment 17•8 years ago
|
||
[Tracking Requested - why for this release]: This is a suspected race leading to a crash or other artifacts. It is a likely regression from bug 1218576.
Updated•8 years ago
|
Priority: P2 → P1
Comment 18•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #14) > "try this with TSan"... is that as simple as following the instructions [1]? > [1]: > https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Thread_Sanitizer Yes, those look they would work. I can also test patches if that helps.
Comment 19•8 years ago
|
||
Julian, i think bug 1278821 (JSHistograms_* & JSKeyedHistograms_*) was the only racy code path left for TelemetryHistogram.cpp. Do you agree?
Flags: needinfo?(jseward)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 21•8 years ago
|
||
I've put a patch up for review which doesn't necessarily fix _all_ of the racy stuff (which is why I'm posting it here instead of against bug 1278821), but it _does_ quiet TSan's complaints of a data race on the very nsTArray we're worried about. My testing was running dom/media/tests/mochitest/test_peerConnection_* tee'd to an output file which I pawed through for data races.
Comment 22•8 years ago
|
||
mozreview-review |
Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review82196 ::: toolkit/components/telemetry/TelemetryHistogram.cpp:1512 (Diff revision 2) > Histogram::ClassType type = h->histogram_type(); > > JS::CallArgs args = CallArgsFromVp(argc, vp); > > if (!internal_CanRecordBase()) { > return true; I agree that we can just start with `JS(Keyed)Histogram_Add()` here, we should at least protect the other `internal_*` calls and the `gHistograms` access in this function while we're here. ::: toolkit/components/telemetry/TelemetryHistogram.cpp:1785 (Diff revision 2) > if (!args[0].isString() || !key.init(cx, args[0])) { > JS_ReportErrorASCII(cx, "Not a string"); > return false; > } > > const uint32_t type = keyed->GetHistogramType(); Ditto here for the `keyed` access.
Attachment #8797216 -
Flags: review?(gfritzsche)
Comment 23•8 years ago
|
||
Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. Catching up on this per IRC, this is ok for me as a start. This basically restores the status in racyness before bug 1218576 and is easy & quick to land to isolate the fallout here. Bug 1278821 can handle the rest.
Attachment #8797216 -
Flags: review+
Assignee | ||
Comment 24•8 years ago
|
||
mozreview-review |
Comment on attachment 8797216 [details] bug 1304519 - Ensure we hold a lock before accumulating. https://reviewboard.mozilla.org/r/82820/#review82208
Attachment #8797216 -
Flags: review+
Comment 25•8 years ago
|
||
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ee7990c78dd9 Ensure we hold a lock before accumulating. r=gfritzsche
Comment 27•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ee7990c78dd9
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Updated•7 years ago
|
Version: unspecified → Trunk
Updated•7 years ago
|
Flags: needinfo?(jseward)
You need to log in
before you can comment on or make changes to this bug.
Description
•