Open Bug 1458221 Opened 3 years ago Updated 2 years ago

Crash in [@ OOM | small] with mozilla::TelemetryIPCAccumulator::AccumulateChildHistogram spiking in ru locales

Categories

(Core :: General, defect, P3)

x86
Windows
defect

Tracking

()

Tracking Status
firefox59 --- affected
firefox60 --- affected
firefox61 --- affected

People

(Reporter: philipp, Unassigned)

Details

(Keywords: crash, stalled)

Crash Data

This bug was filed from the Socorro interface and is
report bp-9570834d-f641-4700-bb4c-809c00180429.
=============================================================

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 mozglue.dll mozalloc_handle_oom memory/mozalloc/mozalloc_oom.cpp:54
2 mozglue.dll moz_xrealloc memory/mozalloc/mozalloc.cpp:95
3 xul.dll nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity<nsTArrayInfallibleAllocator> xpcom/ds/nsTArray-inl.h:183
4 xul.dll nsTArray_Impl<gfxFontFeature, nsTArrayInfallibleAllocator>::AppendElement<gfxFontFeature&, nsTArrayInfallibleAllocator> xpcom/ds/nsTArray.h:2188
5 xul.dll mozilla::TelemetryIPCAccumulator::AccumulateChildHistogram toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:153
6 xul.dll `anonymous namespace'::internal_Accumulate toolkit/components/telemetry/TelemetryHistogram.cpp:998
7 xul.dll TelemetryHistogram::Accumulate toolkit/components/telemetry/TelemetryHistogram.cpp:1937
8 xul.dll mozilla::PaintTelemetry::AutoRecordPaint::~AutoRecordPaint layout/painting/nsDisplayList.cpp:10057
9 xul.dll nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2047

=============================================================

there is a spike for oom|small content crashes in the last couple of days coming from win32bit users of firefox in ru builds involving telemetry code:
https://crash-stats.mozilla.com/signature/?useragent_locale=ru&platform=Windows&proto_signature=~mozilla%3A%3ATelemetryIPCAccumulator&signature=OOM%20%7C%20small&date=%3E%3D2018-04-01#graphs

oom allocation size is 2,048 bytes most of the times. a couple of user comments are referring to tab crashes while playing a game. some mentioned this one ("candy valley") in particular: https://vk.com/app4523773?from_install=1&loc=apps
It appears that quite a few of the URLs are from this Russian game site: https://ok.ru/game/. I see URLs for all different games:

*https://ok.ru/game/gardengame 
*https://ok.ru/game/vegamix

When I scanned the list, I was hard pressed to find a URL that wasn't from that particular site.
AutoRecordPaint records to four histograms every time it is destroyed. It is only used in one place[1], when the view manager has a pending flush. This can happen in a variety of places (including within the refresh driver tick itself[2]). However, I don't think that matters since the allocation size is so small.

The TelemetryIPCAccumulator accumulates in each content process arrays of histograms and things that need to be sent to the parent process (where the accumulations actually take happen). These arrays are flushed either when reaching a high water mark in size, or after 2s of time.

Reaching 2048 bytes of malloc was assumed to be acceptable operation. The high water mark for histograms is at 5k elements (and we'll continue recording accumulations 5x as many before truncation), and each accumulation struct is 64 bytes in size (so the 2048B allocation means an array of size 32).

Is this just a case of memory pressure and we happen to be the unlucky one allocating at this crucial moment?

[1]: https://searchfox.org/mozilla-central/rev/8837610b6c999451435695e800f38d4acbc0a644/layout/base/nsRefreshDriver.cpp#2066
[2]: https://searchfox.org/mozilla-central/rev/8837610b6c999451435695e800f38d4acbc0a644/layout/base/nsRefreshDriver.cpp#2104
yes, most of the reports seem to show "System memory use percentage" in the 80s & 90s. curiously the report from comment #0 is at 63% though and therefore probably not under particular memory pressure...
P2 for visibility. I don't think there's much we can do here, as it appears we're just unlucky to be holding the hot potato.

As for the 63%, a recent conversation on the stability list[1] highlights that we can be killed due to OOM on Windows by running out of Commit, not just used bytes. I'm not sure how likely this is to be the case here, but "memory" is difficult to count :S

[1]: https://mail.mozilla.org/private/stability/2018-May/002226.html (may requires being a list member)
Priority: -- → P2
From our understanding, we are not causing the problem, we just end up being blamed due to the allocation timing.
Component: Telemetry → General
Priority: P2 → --
Product: Toolkit → Core

Should we dupe this over to some sort of generic OOM|small bucket?

Flags: needinfo?(madperson)
Priority: -- → P3

i'm not aware that we have a generic or meta [@ OOM|small] bug. if there's not enough information to progress in this bug, i'm fine with marking it as stalled or resolving as wontfix though.

Flags: needinfo?(madperson)
Keywords: stalled
You need to log in before you can comment on or make changes to this bug.