Closed Bug 1428612 Opened 6 years ago Closed 6 years ago

Crash in OOM | small in TelemetryHistogram::InitializeGlobalState

Categories

(Toolkit :: Telemetry, defect, P1)

56 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- wontfix
firefox58 - wontfix
firefox59 --- fixed
firefox60 --- fixed

People

(Reporter: philipp, Assigned: erahm)

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is
report bp-b64045fe-3046-426d-ae81-59cbc0180106.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:620
1 xul.dll nsTHashtable<nsBaseHashtableET<nsCStringHashKey, mozilla::Telemetry::HistogramID> >::s_InitEntry xpcom/ds/nsTHashtable.h:456
2 xul.dll PLDHashTable::Add xpcom/ds/PLDHashTable.cpp:590
3 xul.dll TelemetryHistogram::InitializeGlobalState toolkit/components/telemetry/TelemetryHistogram.cpp:1719
4 xul.dll `anonymous namespace'::TelemetryImpl::CreateTelemetryInstance toolkit/components/telemetry/Telemetry.cpp:1202
5 xul.dll `anonymous namespace'::nsITelemetryConstructor toolkit/components/telemetry/Telemetry.cpp:1572
6 xul.dll nsComponentManagerImpl::CreateInstanceByContractID xpcom/components/nsComponentManager.cpp:1086
7 xul.dll nsComponentManagerImpl::GetServiceByContractID xpcom/components/nsComponentManager.cpp:1446
8 xul.dll nsCOMPtr_base::assign_from_gs_contractid xpcom/base/nsCOMPtr.cpp:95
9 xul.dll mozilla::Telemetry::Init toolkit/components/telemetry/Telemetry.cpp:1968

=============================================================

these seem to be persistent crashes on startup that hit individual users repeatedly.
some user comments: http://bit.ly/2CQzwxm
Flags: needinfo?(chutten)
Flags: needinfo?(alessio.placitelli)
Priority: -- → P1
To my knowledge no telemetry patches made it into 57.0.3 (which seems the most likely vector for the change that introduced this crash)

Frame 3 is where gNameToHistogramIDMap is being populated with string hashes to histogram ids. It is limited by HistogramCount which can be at most std::numeric_limits<uint32_t>::max() (4294967295) but is actually on the order of about 1700[1]. Not sure how that would OOM unless something subtle was at play.

[1]: https://searchfox.org/mozilla-central/source/__GENERATED__/toolkit/components/telemetry/TelemetryHistogramEnums.h#1863
Flags: needinfo?(chutten)
Since the signature isn't too helpful in this case, here's some links:

Here's a report of it happening: https://crash-stats.mozilla.com/report/index/7f4cd699-a200-4a18-afc7-5c06b0180103

Here's a different report where we OOM a little later when init'ing TelemetryEvent: https://crash-stats.mozilla.com/report/index/6f2a3528-3d91-44f7-9045-f997e0180105

I'm seeing only Windows crashes for this so far.
The volume seems to become high on Dec 27:
https://crash-stats.mozilla.com/signature/?signature=OOM%20%7C%20small&date=%3E%3D2017-10-08T15%3A55%3A17.000Z&date=%3C2018-01-08T14%3A55%3A17.000Z#graphs

However, we deleted crash data from Dec 24 due to another bug. Dec 24 + boxing day(s) should see low usage volume.
The reports from comment 1 are a bit surprising to me (bp-7f4cd699-a200-4a18-afc7-5c06b0180103 and bp-6f2a3528-3d91-44f7-9045-f997e0180105):
- both request tiny allocations, 6 bytes & 23 bytes
- both have physical memory available, 2.33GB & 2.66GB
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> The reports from comment 1 are a bit surprising to me
> (bp-7f4cd699-a200-4a18-afc7-5c06b0180103 and
> bp-6f2a3528-3d91-44f7-9045-f997e0180105):
> - both request tiny allocations, 6 bytes & 23 bytes
> - both have physical memory available, 2.33GB & 2.66GB

Is there *any* way that the amount you're requesting can overflow?  Because that sounds like what's happening here, especially on 64-bit.
Flags: needinfo?(alessio.placitelli)
We're out of our depth here. :jimm, do you have any ideas? The tl;dr is that we're seeing an OOM|small startup crash within Telemetry init despite the report saying there's plenty of available mem. It happens in various places we're allocating memory.
Flags: needinfo?(jmathies)
https://crash-stats.mozilla.com/search/?proto_signature=~TelemetryHistogram%3A%3AInitializeGlobalState&uptime=%3C10&product=Firefox&date=%3E%3D2018-01-10T11%3A55%3A11.000Z&date=%3C2018-01-17T11%3A55%3A11.000Z&_sort=-date&_facets=signature&_facets=version&_facets=total_virtual_memory&_facets=available_virtual_memory&_facets=available_physical_memory&_facets=memory_vsize_max_contiguous&_facets=system_memory_use_percentage&_facets=oom_allocation_size&_facets=memory_heap_allocated&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

* affects 55 and up so this isn't 57 specific
* memory parameter facets don't point to anything obvious
* affected systems don't appear to be experiencing memory stress.
* I'm unable to get virtual memory max contiguous data out of these reports. :/
* uptime generally is less than 10 seconds for these crashes

volume for these crashes is really low, currently 95 reports in 57 across 7 days.

Eric, any thoughts on what could cause these rather unexpected OOMs here?
Flags: needinfo?(jmathies) → needinfo?(erahm)
The stack is a little weird, I'm assuming we're OOMing in the string allocation [1]. OOMing that small that early is *highly* unusual, it could be a sign of corruption which would go along with my next comment.

Some of the other crashes that are in the `nsDependentCString` constructor feel like a corruption issue, they're crashing due to EXCEPTION_ACCESS_VIOLATION_READ, eg:

69bb96ff-49e7-4471-a34a-408310180116
0eb9cd62-4ee2-46c4-bd07-8bf1e0180116
b2ead80b-f587-47cc-836a-bb4b10180116

That would mean either the string table pointer is junk or the offset is junk [2]. One of the addresses had `fee` in it which could mean we're poking memory that was `HeapFree`'d.

There are also a few OOM small instances that are quite large when allocating the gigantic backing table [3]:

d97b0c06-88ad-4a3c-8d4a-8387a0180115
f2ba8008-665f-466b-8689-683fc0180115

I guess that's not super helpful but there are some improvements that can be made:

1) Split up that ginormous array
2) `gNameToHistogramIDMap` actually contains heap allocated strings, contrary to comments claiming it doesn't. We could switch to `nsDepCharHashKey` instead, but that *could* have perf implications. We could also add a `nsDependentCStringHashKey` that doesn't copy, but I think we have to fix our string classes for that to work.
3) We could add an assert that that address returned for name is w/in gHistogramStringTable & gHistogramStringTable + sizeof(
gHistogramStringTable)

[1] https://searchfox.org/mozilla-central/rev/48cbb200aa027a0a379b6004b6196a167344b865/xpcom/string/nsTSubstring.cpp#441
[2] https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/toolkit/components/telemetry/TelemetryHistogram.cpp#439
[3] https://searchfox.org/mozilla-central/rev/1f9b4f0f0653b129b4db1b9fc5e9068054afaac0/toolkit/components/telemetry/TelemetryHistogram.cpp#1729-1730
Flags: needinfo?(erahm)
This switches from `nsDependentCString` to `nsLiteralCString` when adding
histograms to the table. It will help save some memory and should avoid the OOM
smalls we have been seeing.
Attachment #8943459 - Flags: review?(gfritzsche)
Assignee: nobody → erahm
Status: NEW → ASSIGNED
This adds a dignostic assertion that the name pointer is actually within the
static name table.
Attachment #8943460 - Flags: review?(gfritzsche)
Suggestions 2 and 3 were pretty easy so I went ahead and took care of them.
From comment 9 it sounds like the volume is not huge, so I'll call this wontfix for 58.
Comment on attachment 8943459 [details] [diff] [review]
Part 1: Use nsLiteralCString to wrap histogram names

Review of attachment 8943459 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the help Eric.
This looks good to me.
Attachment #8943459 - Flags: review?(gfritzsche) → review+
Comment on attachment 8943460 [details] [diff] [review]
Part 2: Add an assertion to check the histogram name pointer

Review of attachment 8943460 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8943460 - Flags: review?(gfritzsche) → review+
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #10)
> I guess that's not super helpful but there are some improvements that can be
> made:
> 
> 1) Split up that ginormous array

Do you think this contiguous array size is problematic?
We end up using `ptr-size * 17170` and `ptr-size * 8585` respectively.
I can file a follow-up for breaking this down if that will avoid issues.
Flags: needinfo?(erahm)
(In reply to Georg Fritzsche [:gfritzsche] from comment #17)
> (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment
> #10)
> > I guess that's not super helpful but there are some improvements that can be
> > made:
> > 
> > 1) Split up that ginormous array
> 
> Do you think this contiguous array size is problematic?
> We end up using `ptr-size * 17170` and `ptr-size * 8585` respectively.
> I can file a follow-up for breaking this down if that will avoid issues.

Its probably worth trying as a followup, there were a few crashes:

bp-d97b0c06-88ad-4a3c-8d4a-8387a0180115
bp-f2ba8008-665f-466b-8689-683fc0180115
Flags: needinfo?(erahm)
Needs rebasing.
Flags: needinfo?(erahm)
Keywords: checkin-needed
Pushed by erahm@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/13493806c047
Part 1: Use nsLiteralCString to wrap histogram names. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/6931047d462f
Part 2: Add an assertion to check the histogram name pointer. r=gfritzsche
Flags: needinfo?(erahm)
https://hg.mozilla.org/mozilla-central/rev/13493806c047
https://hg.mozilla.org/mozilla-central/rev/6931047d462f
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Please request Beta approval on this when you get a chance.
Flags: needinfo?(erahm)
Comment on attachment 8943459 [details] [diff] [review]
Part 1: Use nsLiteralCString to wrap histogram names

Approval Request Comment
[Feature/Bug causing the regression]:
Possible memory corruption, incorrect wrapping of static strings.
[User impact if declined]:
Slightly higher memory usage, random crashes.
[Is this code covered by automated tests?]:
Yes.
[Has the fix been verified in Nightly?]:
Yes, been on nightly for a few days.
[Needs manual test from QE? If yes, steps to reproduce]:
No. 
[List of other uplifts needed for the feature/fix]:
Part 2 of this patch.
[Is the change risky?]:
Low risk.
[Why is the change risky/not risky?]:
This just properly wraps static strings and the next patch introduces a diagnostic assert that will help track down crashes.
[String changes made/needed]:
N/A
Flags: needinfo?(erahm)
Attachment #8943459 - Flags: approval-mozilla-beta?
Comment on attachment 8943460 [details] [diff] [review]
Part 2: Add an assertion to check the histogram name pointer

(See comment 23)
Attachment #8943460 - Flags: approval-mozilla-beta?
Comment on attachment 8943459 [details] [diff] [review]
Part 1: Use nsLiteralCString to wrap histogram names

Since this might help us avoid or diagnose some OOM crashes, let's uplift to beta. This should be in 59 beta 6 by the end of the week.
Attachment #8943459 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8943460 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: