Closed Bug 1264820 Opened 4 years ago Closed 4 years ago

Measure IPC reply size in telemetry

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s ? ---
firefox46 --- wontfix
firefox47 --- fixed
firefox48 --- fixed
firefox49 --- fixed

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: btpp-active)

Attachments

(1 file)

I forgot to measure the size of the reply in bug 1260908.
Attached patch patchSplinter Review
I'll change this to capacity before landing. It's not in my tree right now.
Attachment #8741572 - Flags: review?(continuation)
Comment on attachment 8741572 [details] [diff] [review]
patch

Review of attachment 8741572 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Histograms.json
@@ +10528,5 @@
> +    "kind": "exponential",
> +    "high": 8000000,
> +    "n_buckets": 50,
> +    "keyed": true,
> +    "description": "Measures the size of IPC messages by message name"

"IPC reply messages" maybe?
Attachment #8741572 - Flags: review?(continuation) → review+
Whiteboard: btpp-active
Blocks: 1262918
tracking-e10s: --- → ?
Comment on attachment 8741572 [details] [diff] [review]
patch

Review of attachment 8741572 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +1243,5 @@
>      MOZ_RELEASE_ASSERT(reply->is_sync());
>  
>      *aReply = Move(*reply);
> +    if (aReply->size() >= kMinTelemetryMessageSize) {
> +        Telemetry::Accumulate(Telemetry::IPC_REPLY_SIZE, nsCString(aMsg->name()), aReply->size());

Can you make this nsDependentCString so that we're not allocating unnecessarily?  Thanks!
Backed out:
 https://hg.mozilla.org/integration/mozilla-inbound/rev/051353392264
...for Werror build bustage:
> ipc/glue/MessageChannel.cpp:1247:27: error: comparison between signed
>   and unsigned integer expressions [-Werror=sign-compare]

Here's a TreeHerder cycle showing the bustage:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=ae89c24b76920d2203353580091f4a9240b148af
Flags: needinfo?(wmccloskey)
s/test failures/ASAN heap-use-after-free, in Linux64-ASAN e10s crashtest, jsreftest, & build/

(Not sure how that platform got past the build warning, actually -- maybe we build without warnings as errors in our ASAN builds for some reason.)

Anyway: probably worth doing a try run on that platform to ensure the ASAN issue is fixed before re-landing.
I think the Werror problem is because "if (aReply->size() >= kMinTelemetryMessageSize) {" should use ->capacity() instead. I assume it returns a different type. The UAF is because the telemetry is using aMsg->name() instead of aReply->name(). This is an argument for bug 1267326. Or at least, that would have prevented the UAF. I'll land the fix once I have tested a bit locally.
Here goes nothing.
Flags: needinfo?(wmccloskey)
This should be low risk, so once it is on Nightly and I check that it is actually working with telemetry, it should get backported to Beta.
https://hg.mozilla.org/mozilla-central/rev/8ecea0f57fd0
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Target Milestone: mozilla48 → mozilla49
One message that will be interesting to look at is the reply for ReadPrefsArray(). I found a crash where the message was being grown to 220MB and it results in an OOM crash: https://crash-stats.mozilla.com/report/index/01246d16-3c71-4bb1-a148-35b992160428
Telemetry data is starting to come in, but the message name is mostly "???". Not always, though! I can reproduce that locally. I'll look into why that is. It probably isn't the fault of this patch per se but it still makes the telemetry not useful.
See Also: → 1268938
Blocks: 1268938
See Also: 1268938
We'll start to get names soon, but telemetry suggests this isn't a major source of large messages. There are around 22 thousand messages around 400kb or larger, compared to around a few million messages that size that are PBrowser::Msg_AsyncMessage in the same period.
Comment on attachment 8741572 [details] [diff] [review]
patch

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: This collects data about one kind of IPC message, to help investigate e10s OOMs.
[Describe test coverage new/current, TreeHerder]: This runs frequently.
[Risks and why]: low, it just adds new telemetry
[String/UUID change made/needed]: none
Attachment #8741572 - Flags: approval-mozilla-beta?
Attachment #8741572 - Flags: approval-mozilla-aurora?
Comment on attachment 8741572 [details] [diff] [review]
patch

Telemetry data to investigate e10s OOMs, plus data was verified on Nightly, Beta47+, Aurora48+
Attachment #8741572 - Flags: approval-mozilla-beta?
Attachment #8741572 - Flags: approval-mozilla-beta+
Attachment #8741572 - Flags: approval-mozilla-aurora?
Attachment #8741572 - Flags: approval-mozilla-aurora+
One unfortunate thing that Bill and I noticed is that we don't have any telemetry for the reply to PContent::Msg_ReadPrefsArray, which as we know from bug 1268662 can sometimes be large enough to cause OOM crashes. In a local build, this is because TelemetryImpl::IsInitialized() is returning false, presumably because we set up prefs very early in the startup process. On the plus side, this seems to be the only message, reply or otherwise, that we don't record telemetry for, at least that I can see locally in a clean profile.
You need to log in before you can comment on or make changes to this bug.