Closed Bug 1260908 Opened 7 years ago Closed 7 years ago

Record message sizes in telemetry


(Core :: IPC, defect, P1)




Tracking Status
e10s m9+ ---
firefox46 --- wontfix
firefox47 --- fixed
firefox48 --- fixed


(Reporter: billm, Assigned: billm)




(4 files, 2 obsolete files)

The goal here is to find any IPC messages that are really big. Maybe we could break them into smaller pieces so that we don't need contiguous memory for them.
I'm trying to call Accumulate for a keyed histogram on a path that sometimes runs before telemetry has started up. Right now this asserts because the keyed histograms haven't been initialized. This patch just allows us to skip accumulating in that case.
Attachment #8736494 - Flags: review?(gfritzsche)
This patch records the size of normal IPC messages and message manager messages in keyed histograms. I set a lower bound of 8K to avoid recording too much data.
Attachment #8736496 - Flags: review?(dvander)
Comment on attachment 8736494 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup

Review of attachment 8736494 [details] [diff] [review]:

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3902,5 @@
>  void
>  Accumulate(ID aID, const nsCString& aKey, uint32_t aSample)
>  {
>    if (!TelemetryImpl::CanRecordBase()) {
>      return;

I assume this is happening with sTelemetry==nullptr?
If so, let's just add this check here instead.
Attachment #8736494 - Flags: review?(gfritzsche)
That field is private. Hopefully this is okay.
Attachment #8736494 - Attachment is obsolete: true
Attachment #8736778 - Flags: review?(gfritzsche)
Comment on attachment 8736778 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup, v2

Review of attachment 8736778 [details] [diff] [review]:

Thanks, this looks fine if we can restore the assert.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3913,5 @@
>      return;
>    }
>    const TelemetryHistogram& th = gHistograms[aID];
>    KeyedHistogram* keyed = TelemetryImpl::GetKeyedHistogramById(nsDependentCString(;
> +  if (keyed) {

Now we should be able to restore the MOZ_ASSERT() as before.
Attachment #8736778 - Flags: review?(gfritzsche) → review+
Attachment #8736496 - Flags: review?(dvander) → review+
Flags: needinfo?(wmccloskey)
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Some data has come in from this now:

Nothing too surprising in there. Something like 0.02% of samples for PHttpChannel::Msg_OnTransportAndData are around 1MB or so. I don't know if that would be enough to kick up our OOM rates.
In case you are looking at the telemetry, it is just showing the first four messages. It isn't in any particular order. For instance, PBrowser::Msg_AsyncMessage has 1.2% of recorded samples between 2.21MB and 3MB...
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

We'd like to gather this telemetry to understand some OOM crashes we're seeing in the e10s beta experiment.

Approval Request Comment
[Feature/regressing bug #]: no bug
[User impact if declined]: none
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: very low risk, just telemetry
[String/UUID change made/needed]: none
Attachment #8736496 - Flags: approval-mozilla-beta?
Attachment #8736496 - Flags: approval-mozilla-aurora?
Comment on attachment 8736778 [details] [diff] [review]
Allow keyed histograms to accumulate early in startup, v2

Same as above.
Attachment #8736778 - Flags: approval-mozilla-beta?
Attachment #8736778 - Flags: approval-mozilla-aurora?
Here's the data we have so far:

Here are all the IPC messages where we've recorded data of over 1 MB (with the max size to the right):
 (8000000, u'PStorage::Msg_LoadItem'),
 (8000000, u'PContent::Msg_AsyncMessage'),
 (8000000, u'PBrowser::Msg_AsyncMessage'),
 (8000000, u'PBackgroundIDBTransaction::Msg_PBackgroundIDBRequestConstructor'),
 (8000000, u'PBackgroundIDBRequest::Msg___delete__'),
 (8000000, u'PBackgroundIDBCursor::Msg_Response'),
 (5802597, u'PHttpChannel::Msg_OnTransportAndData'),
 (4208766, u'PStorage::Msg_OriginsHavingData'),
 (4208766, u'PJavaScript::Msg_CallOrConstruct'),
 (3052721, u'PWebSocket::Msg_OnMessageAvailable'),
 (3052721, u'PContent::Msg_InvokeDragSession'),
 (1164888, u'PContent::Msg_LoadAndRegisterSheet')

Here are the message manager messages:
 (8000000, u'sdk/remote/process/message'),
 (8000000, u'SessionStore:update'),
 (8000000, u'SessionStore:restoreHistory'),
 (4208766, u'debug:netmonitor:*.*.*/*:updateEvent'),
 (4208766, u'AdblockPlus:Message'),
 (3052721, u'wombat::core::response'),
 (3052721, u'iescreenshot-download'),
 (3052721, u'greasemonkey:scripts-update'),
 (1164888, u'ublock0:sb:2236'),
 (1164888, u'ublock0:sb:2235'),
 (1164888, u'resource://kango-unity-extensionkeeebcom/message'),
 (1164888, u'imagus:sb:398'),
 (1164888, u'imagus:sb:175'),
 (1164888, u'imagus:sb:127')

I had to collapse together a bunch of debug:netmonitor messages.
Andrew, would you mind filing bugs to investigate whether we can improve some of these? Aside from session store, I'm not too familiar with any of this code.
Flags: needinfo?(continuation)
Actually, this data is for the parent process only. I'll work on child process data.
Sure, I can look over them and figure out which ones are more frequent. I've already filed one for SessionStorage:update (bug 1262661). I think it makes sense to take into account how many of these large messages of each size there are when deciding what to fix. Some of the messages are around 1000 times more common than other ones.
Flags: needinfo?(continuation)
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

Gather more telemetry data to investigate e10s OOM crashes.
Attachment #8736496 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8736778 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8736496 [details] [diff] [review]
Record message sizes

Too late for the 46 e10s experiment.
Attachment #8736496 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #8736778 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
I'm working on a fix. I have no idea why this works on trunk and fails on Aurora, but the fix looks easy enough.
tracking-e10s: --- → ?
Might as well fix this on trunk, too.

Olli, this is a tiny patch. Anybody could basically review it.

StructuredCloneData::DataLength() has type size_t, so change the constant to also be a constant. The constant in MessageChannel can stay as an int, because it is compared to Pickle::size(), which returns an int for some reason.
Attachment #8741360 - Flags: review?(bugs)
> so change the constant to also be a constant.
This should be "so change the constant to also be a size_t".
One drawback of this patch that I noticed is that it measures the size() of the data in the message, not the capacity(), so it isn't measuring the full memory impact of these messages, and will miss improvements from bugs like bug 1263235 and bug 1263953. I'll file a followup for that.
Comment on attachment 8741360 [details] [diff] [review]
Fix type of kMinTelemetryMessageSize.

I guess I can just fix this as part of changing to capacity().
Attachment #8741360 - Flags: review?(bugs)
Comment on attachment 8741360 [details] [diff] [review]
Fix type of kMinTelemetryMessageSize.

Never mind, that's separate. Sorry for the noise!
Attachment #8741360 - Flags: review?(bugs)
Attachment #8741360 - Flags: review?(bugs) → review+
Capacity includes internal fragmentation, while size does not.

This requires making capacity() public, but that seems benign.
Attachment #8741386 - Flags: review?(wmccloskey)
Comment on attachment 8741386 [details] [diff] [review]
Record IPC message capacity instead of size.

Wrong bug, sorry.
Attachment #8741386 - Attachment is obsolete: true
Attachment #8741386 - Flags: review?(wmccloskey)
Ryan, could you land this on Aurora, with my additional patch folded into Bill's second patch? I'm sure it will fix the issue. Thanks.
Flags: needinfo?(ryanvm)
Priority: -- → P1
Flags: needinfo?(ryanvm)
I think we should stop recording MESSAGE_MANAGER_MESSAGE_SIZE on beta so we don't spam the telemetry when it goes to release (bug 1274442). This telemetry has already been disabled, and just stopping to send this seems lower risk than trying to fix it in place via bug 1275707. We're late enough in beta that there's not too much value to be gotten from collecting this telemetry.
Attachment #8756942 - Flags: review?(wmccloskey)
Attachment #8756942 - Flags: review?(wmccloskey) → review+
Comment on attachment 8756942 [details] [diff] [review]

Thanks for the fast review.

Approval Request Comment
[Feature/regressing bug #]: bug 1260908
[User impact if declined]: This is causing some issues with telemetry, so I want to just remove it before we release.
[Describe test coverage new/current, TreeHerder]: this code runs very frequently
[Risks and why]: Very low. This just stops collecting one kind of telemetry by backing out part of a patch. I'm planning to fix this issue outside of beta in bug 1275707, but disabling the telemetry is lower risk than changing slightly how it is collected.
[String/UUID change made/needed]: none
Attachment #8756942 - Flags: approval-mozilla-beta?
Comment on attachment 8756942 [details] [diff] [review]

At this point in time, only fixes that might lead to a dot release (like critical regressions, stability or security issues) are being considered for uplift. 47.0b9 build is already in progress, RC1 gtb is Monday. I would prefer to not uplift this fix but let it ride the 48 train. Andrew is ok with that.
Attachment #8756942 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
My impression is that this is only causing issues for telemetry in the website frontend side of things, and they've worked around that already so it isn't a huge deal to just leave this alone.
There was actually a patch landed in 47, it is just a partial backout that won't be landed there.
You need to log in before you can comment on or make changes to this bug.