All users were logged out of Bugzilla on October 13th, 2018

Record message sizes in telemetry

RESOLVED FIXED in Firefox 47

Status

()

P1
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: billm, Assigned: billm)

Tracking

unspecified
mozilla48
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10sm9+, firefox46 wontfix, firefox47 fixed, firefox48 fixed)

Details

Attachments

(4 attachments, 2 obsolete attachments)

The goal here is to find any IPC messages that are really big. Maybe we could break them into smaller pieces so that we don't need contiguous memory for them.
Created attachment 8736494 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup

I'm trying to call Accumulate for a keyed histogram on a path that sometimes runs before telemetry has started up. Right now this asserts because the keyed histograms haven't been initialized. This patch just allows us to skip accumulating in that case.
Attachment #8736494 - Flags: review?(gfritzsche)
Created attachment 8736496 [details] [diff] [review]
Record message sizes

This patch records the size of normal IPC messages and message manager messages in keyed histograms. I set a lower bound of 8K to avoid recording too much data.
Attachment #8736496 - Flags: review?(dvander)
Comment on attachment 8736494 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup

Review of attachment 8736494 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3902,5 @@
>  void
>  Accumulate(ID aID, const nsCString& aKey, uint32_t aSample)
>  {
>    if (!TelemetryImpl::CanRecordBase()) {
>      return;

I assume this is happening with sTelemetry==nullptr?
If so, let's just add this check here instead.
Attachment #8736494 - Flags: review?(gfritzsche)
Created attachment 8736778 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup, v2

That field is private. Hopefully this is okay.
Attachment #8736494 - Attachment is obsolete: true
Attachment #8736778 - Flags: review?(gfritzsche)
Comment on attachment 8736778 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup, v2

Review of attachment 8736778 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks fine if we can restore the assert.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3913,5 @@
>      return;
>    }
>    const TelemetryHistogram& th = gHistograms[aID];
>    KeyedHistogram* keyed = TelemetryImpl::GetKeyedHistogramById(nsDependentCString(th.id()));
> +  if (keyed) {

Now we should be able to restore the MOZ_ASSERT() as before.
Attachment #8736778 - Flags: review?(gfritzsche) → review+
Attachment #8736496 - Flags: review?(dvander) → review+
Flags: needinfo?(wmccloskey)

Comment 9

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8d4ce12bda44
https://hg.mozilla.org/mozilla-central/rev/c8eea76ef144
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox48: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Some data has come in from this now: http://mzl.la/1XfxYN1

Nothing too surprising in there. Something like 0.02% of samples for PHttpChannel::Msg_OnTransportAndData are around 1MB or so. I don't know if that would be enough to kick up our OOM rates.
In case you are looking at the telemetry, it is just showing the first four messages. It isn't in any particular order. For instance, PBrowser::Msg_AsyncMessage has 1.2% of recorded samples between 2.21MB and 3MB...
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

We'd like to gather this telemetry to understand some OOM crashes we're seeing in the e10s beta experiment.

Approval Request Comment
[Feature/regressing bug #]: no bug
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: very low risk, just telemetry
[String/UUID change made/needed]: none
Attachment #8736496 - Flags: approval-mozilla-beta?
Attachment #8736496 - Flags: approval-mozilla-aurora?
Comment on attachment 8736778 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup, v2

Same as above.
Attachment #8736778 - Flags: approval-mozilla-beta?
Attachment #8736778 - Flags: approval-mozilla-aurora?
Here's the data we have so far:
https://gist.github.com/bill-mccloskey/65b1115556dbbacdf020f749e4b3dd98

Here are all the IPC messages where we've recorded data of over 1 MB (with the max size to the right):
 (8000000, u'PStorage::Msg_LoadItem'),
 (8000000, u'PContent::Msg_AsyncMessage'),
 (8000000, u'PBrowser::Msg_AsyncMessage'),
 (8000000, u'PBackgroundIDBTransaction::Msg_PBackgroundIDBRequestConstructor'),
 (8000000, u'PBackgroundIDBRequest::Msg___delete__'),
 (8000000, u'PBackgroundIDBCursor::Msg_Response'),
 (5802597, u'PHttpChannel::Msg_OnTransportAndData'),
 (4208766, u'PStorage::Msg_OriginsHavingData'),
 (4208766, u'PJavaScript::Msg_CallOrConstruct'),
 (3052721, u'PWebSocket::Msg_OnMessageAvailable'),
 (3052721, u'PContent::Msg_InvokeDragSession'),
 (1164888, u'PContent::Msg_LoadAndRegisterSheet')

Here are the message manager messages:
 (8000000, u'sdk/remote/process/message'),
 (8000000, u'SessionStore:update'),
 (8000000, u'SessionStore:restoreHistory'),
 (4208766, u'debug:netmonitor:*.*.*/*:updateEvent'),
 (4208766, u'AdblockPlus:Message'),
 (3052721, u'wombat::core::response'),
 (3052721, u'iescreenshot-download'),
 (3052721, u'greasemonkey:scripts-update'),
 (1164888, u'ublock0:sb:2236'),
 (1164888, u'ublock0:sb:2235'),
 (1164888, u'resource://kango-unity-extensionkeeebcom/message'),
 (1164888, u'imagus:sb:398'),
 (1164888, u'imagus:sb:175'),
 (1164888, u'imagus:sb:127')

I had to collapse together a bunch of debug:netmonitor messages.
Andrew, would you mind filing bugs to investigate whether we can improve some of these? Aside from session store, I'm not too familiar with any of this code.
Flags: needinfo?(continuation)
Actually, this data is for the parent process only. I'll work on child process data.
Sure, I can look over them and figure out which ones are more frequent. I've already filed one for SessionStorage:update (bug 1262661). I think it makes sense to take into account how many of these large messages of each size there are when deciding what to fix. Some of the messages are around 1000 times more common than other ones.
Flags: needinfo?(continuation)

Updated

3 years ago
status-firefox46: --- → affected
status-firefox47: --- → affected
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

Gather more telemetry data to investigate e10s OOM crashes.
Attachment #8736496 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

3 years ago
Attachment #8736778 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

Too late for the 46 e10s experiment.
Attachment #8736496 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #8736778 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
I'm working on a fix. I have no idea why this works on trunk and fails on Aurora, but the fix looks easy enough.
tracking-e10s: --- → ?
Created attachment 8741360 [details] [diff] [review]
Fix type of kMinTelemetryMessageSize.

Might as well fix this on trunk, too.

Olli, this is a tiny patch. Anybody could basically review it.

StructuredCloneData::DataLength() has type size_t, so change the constant to also be a constant. The constant in MessageChannel can stay as an int, because it is compared to Pickle::size(), which returns an int for some reason.
Attachment #8741360 - Flags: review?(bugs)
> so change the constant to also be a constant.
This should be "so change the constant to also be a size_t".
status-firefox46: affected → wontfix
One drawback of this patch that I noticed is that it measures the size() of the data in the message, not the capacity(), so it isn't measuring the full memory impact of these messages, and will miss improvements from bugs like bug 1263235 and bug 1263953. I'll file a followup for that.
Comment on attachment 8741360 [details] [diff] [review]
Fix type of kMinTelemetryMessageSize.

I guess I can just fix this as part of changing to capacity().
Attachment #8741360 - Flags: review?(bugs)
Comment on attachment 8741360 [details] [diff] [review]
Fix type of kMinTelemetryMessageSize.

Never mind, that's separate. Sorry for the noise!
Attachment #8741360 - Flags: review?(bugs)

Updated

3 years ago
Attachment #8741360 - Flags: review?(bugs) → review+
Blocks: 1264662
Created attachment 8741386 [details] [diff] [review]
Record IPC message capacity instead of size.

Capacity includes internal fragmentation, while size does not.

This requires making capacity() public, but that seems benign.
Attachment #8741386 - Flags: review?(wmccloskey)
Comment on attachment 8741386 [details] [diff] [review]
Record IPC message capacity instead of size.

Wrong bug, sorry.
Attachment #8741386 - Attachment is obsolete: true
Attachment #8741386 - Flags: review?(wmccloskey)
Ryan, could you land this on Aurora, with my additional patch folded into Bill's second patch? I'm sure it will fix the issue. Thanks.
Flags: needinfo?(ryanvm)

Updated

3 years ago
tracking-e10s: ? → m9+
Priority: -- → P1
Flags: needinfo?(ryanvm)
Created attachment 8756942 [details] [diff] [review]
Don't record MESSAGE_MANAGER_MESSAGE_SIZE on beta.

I think we should stop recording MESSAGE_MANAGER_MESSAGE_SIZE on beta so we don't spam the telemetry when it goes to release (bug 1274442). This telemetry has already been disabled, and just stopping to send this seems lower risk than trying to fix it in place via bug 1275707. We're late enough in beta that there's not too much value to be gotten from collecting this telemetry.
Attachment #8756942 - Flags: review?(wmccloskey)
Attachment #8756942 - Flags: review?(wmccloskey) → review+
Comment on attachment 8756942 [details] [diff] [review]
Don't record MESSAGE_MANAGER_MESSAGE_SIZE on beta.

Thanks for the fast review.

Approval Request Comment
[Feature/regressing bug #]: bug 1260908
[User impact if declined]: This is causing some issues with telemetry, so I want to just remove it before we release.
[Describe test coverage new/current, TreeHerder]: this code runs very frequently
[Risks and why]: Very low. This just stops collecting one kind of telemetry by backing out part of a patch. I'm planning to fix this issue outside of beta in bug 1275707, but disabling the telemetry is lower risk than changing slightly how it is collected.
[String/UUID change made/needed]: none
Attachment #8756942 - Flags: approval-mozilla-beta?
Comment on attachment 8756942 [details] [diff] [review]
Don't record MESSAGE_MANAGER_MESSAGE_SIZE on beta.

At this point in time, only fixes that might lead to a dot release (like critical regressions, stability or security issues) are being considered for uplift. 47.0b9 build is already in progress, RC1 gtb is Monday. I would prefer to not uplift this fix but let it ride the 48 train. Andrew is ok with that.
Attachment #8756942 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Updated

2 years ago
status-firefox47: fixed → wontfix
My impression is that this is only causing issues for telemetry in the website frontend side of things, and they've worked around that already so it isn't a huge deal to just leave this alone.
There was actually a patch landed in 47, it is just a partial backout that won't be landed there.
status-firefox47: wontfix → fixed
You need to log in before you can comment on or make changes to this bug.