Crash in mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms

RESOLVED FIXED in Firefox 56

Status

()

defect
P1
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: calixte, Assigned: chutten)

Tracking

({crash})

Trunk
mozilla56
Unspecified
Linux
Points:
---

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox54 wontfix, firefox55 wontfix, firefox56 fixed)

Details

(Whiteboard: [measurement:client], crash signature)

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-5a2cf234-be9a-48e6-b2b1-8ab420170531.
=============================================================

I got this crash in trying to load https://people-mozilla.org/~cdenizet/callgraph.json.
The backtrace from gdb is the following:
Thread 1 "Web Content" received signal SIGSEGV, Segmentation fault.
mozilla::ipc::ProcessLink::SendMessage (this=<optimized out>, msg=0x7fbd9d831c80)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageLink.cpp:149
149	      MOZ_CRASH("IPC message size is too large");
(gdb) bt
#0  mozilla::ipc::ProcessLink::SendMessage (this=<optimized out>, msg=0x7fbd9d831c80)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageLink.cpp:149
#1  0x00007fc0723fa5d9 in mozilla::ipc::MessageChannel::Send (this=0x7fc07f268120, aMsg=aMsg@entry=0x7fbd9d831c80)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageChannel.cpp:936
#2  0x00007fc072698dfa in mozilla::dom::PContentChild::SendAccumulateChildHistograms (this=0x7fc07f268020, accumulations=...)
    at /home/calixte/dev/mozilla/mozilla-central.hg/obj-x86_64-pc-linux-gnu/ipc/ipdl/PContentChild.cpp:4823
#3  0x00007fc074280040 in SendAccumulatedData<mozilla::dom::ContentChild> (ipcActor=0x7fc07f268020)
    at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:264
#4  mozilla::TelemetryIPCAccumulator::IPCTimerFired (aTimer=aTimer@entry=0x0, aClosure=aClosure@entry=0x0)
    at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:297
#5  0x00007fc074280261 in (anonymous namespace)::<lambda()>::operator() (__closure=<optimized out>)
    at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:122
#6  mozilla::detail::RunnableFunction<(anonymous namespace)::DispatchIPCTimerFired()::<lambda()> >::Run(void) (this=<optimized out>)
    at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThreadUtils.h:460
#7  0x00007fc07207710e in mozilla::SchedulerGroup::Runnable::Run (this=0x7fbdb4ee3b50)
    at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/SchedulerGroup.cpp:359
#8  0x00007fc072084531 in nsThread::ProcessNextEvent (this=0x7fc07f2d8ba0, aMayWait=<optimized out>, aResult=0x7ffc4a79882f)
    at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThread.cpp:1322
#9  0x00007fc072082835 in NS_ProcessNextEvent (aThread=<optimized out>, aThread@entry=0x7fc07f2d8ba0, aMayWait=aMayWait@entry=false)
    at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThreadUtils.cpp:472
#10 0x00007fc0723f2e06 in mozilla::ipc::MessagePump::Run (this=0x7fc07f2cb920, aDelegate=0x7ffc4a798a30)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessagePump.cpp:96
#11 0x00007fc0723b66e9 in MessageLoop::RunHandler (this=<optimized out>)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:231
#12 MessageLoop::Run (this=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:211
#13 0x00007fc07361d233 in nsBaseAppShell::Run (this=0x7fc06b9f3f40)
    at /home/calixte/dev/mozilla/mozilla-central.hg/widget/nsBaseAppShell.cpp:156
#14 0x00007fc0742cdfcb in XRE_RunAppShell () at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/xre/nsEmbedFunctions.cpp:893
#15 0x00007fc0723b66e9 in MessageLoop::RunHandler (this=0x7ffc4a798a30)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:231
#16 MessageLoop::Run (this=this@entry=0x7ffc4a798a30)
    at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:211
#17 0x00007fc0742ce39f in XRE_InitChildProcess (aArgc=13, aArgv=0x7ffc4a798d68, aChildData=<optimized out>)
    at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/xre/nsEmbedFunctions.cpp:709
#18 0x00005590ad2aefd9 in content_process_main (bootstrap=0x7fc07f2ba0a0, argc=16, argv=0x7ffc4a798d68)
    at /home/calixte/dev/mozilla/mozilla-central.hg/browser/app/../../ipc/contentproc/plugin-container.cpp:64
#19 0x00005590ad2aeb1f in main (argc=16, argv=0x7ffc4a798d68, envp=0x7ffc4a798df0)
    at /home/calixte/dev/mozilla/mozilla-central.hg/browser/app/nsBrowserApp.cpp:285
(gdb) p msg->header_->payload_size
$2 = 1075819928
I get a similar signature after loading https://people-mozilla.org/~cdenizet/callgraph.json.

https://crash-stats.mozilla.com/report/index/a9ef6517-e702-4071-9b68-af1a90170531
https://crash-stats.mozilla.com/report/index/be90d471-0113-4156-b7e2-852c90170531
Crash Signature: [@ mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms] → [@ mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms][@ mozilla::ipc::ProcessLink::SendMessageW | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms ]
Hi George, it seems the crash resulting from a telemetry ipc message size is too large. Is there anything we can do in telemetry ipc code?
Flags: needinfo?(gfritzsche)
Chris, can you take a look?
Flags: needinfo?(gfritzsche) → needinfo?(chutten)
(Assignee)

Comment 4

2 years ago
What is "too large"? The telemetry IPC code has some watermarks[1][2] to try and avoid egregiously-sized IPC messages that _ought_ to prevent such things (unless they're the wrong size, or otherwise ineffective... or if the main thread is too wigged out to serve the request in a timely fashion or the parent process can't receive it for whatever reason)

[1]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#46-47,50
[2]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#141,156,172,214
Flags: needinfo?(chutten) → needinfo?(htsai)
(In reply to Chris H-C :chutten from comment #4)
> What is "too large"? The telemetry IPC code has some watermarks[1][2] to try
> and avoid egregiously-sized IPC messages that _ought_ to prevent such things
> (unless they're the wrong size, or otherwise ineffective... or if the main
> thread is too wigged out to serve the request in a timely fashion or the
> parent process can't receive it for whatever reason)
> 
> [1]:
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/
> TelemetryIPCAccumulator.cpp#46-47,50
> [2]:
> http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/
> TelemetryIPCAccumulator.cpp#141,156,172,214

Hi Chris, I was guessing the "too large" problem as I saw |MOZ_CRASH("IPC message size is too large");| on comment 0. If it shouldn't be a size issue, do you have ideas of what's up in this crash signature or help me point to the right person? Thank you.
Flags: needinfo?(htsai)
Flags: needinfo?(chutten)
(Reporter)

Comment 6

2 years ago
For information, I added the payload size at the end of the backtrace:
(gdb) p msg->header_->payload_size
$2 = 1075819928
(Assignee)

Comment 7

2 years ago
A gig? Oh wow, that's impressive. How could that have happened...

I find it unlikely that it is full of legitimately-accumulated telemetry data. Each accumulation is two uint32_t...

Oh, but a KeyedAccumulation has a nsCString key. That could be a problem, since I don't see any code limiting those keys' lengths. That could explain some of the OOMs we occasionally see in Telemetry IPC code due to outrageously-large allocations, too...

Georg, is there supposed to be a length limit for keyed{Histograms,Scalars,etc}' keys? 

Irrespective of this, a sensible fix for this might be to rewrite the watermarks in terms of the size of the structs we're sending.
Flags: needinfo?(chutten) → needinfo?(gfritzsche)
This would be bug 1275035.
Possibly the right fix here is to land that with a conservative value (+ some asserts and error telemetry?).
Flags: needinfo?(gfritzsche)
(In reply to Calixte Denizet (:calixte) from comment #0)
> #2  0x00007fc072698dfa in
> mozilla::dom::PContentChild::SendAccumulateChildHistograms
> (this=0x7fc07f268020, accumulations=...)
>     at
> /home/calixte/dev/mozilla/mozilla-central.hg/obj-x86_64-pc-linux-gnu/ipc/
> ipdl/PContentChild.cpp:4823

But note that this is not in SendAccumulateChildKeyedHistograms.
(Assignee)

Comment 10

2 years ago
Then I am baffled. struct Accumulation is exactly two uint32_t. payload_size is 1075819928. Assuming payload_size is in bytes, that's over 134 Million accumulations. That's over 67k per millisecond (2s timer _ought_ to preclude these accumulations from being over 2000ms old).

That seems rather impossible.
(Assignee)

Updated

2 years ago
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Component: DOM → Telemetry
Product: Core → Toolkit
Whiteboard: [measurement:client]
Priority: -- → P1
Per talking through it this week, running up accumulation arrays of that size is actually possible (recording into Telemetry is faster than assumed).
The problem is presumably that some scenarios accumulate Telemetry too fast while we wait for the main thread to clear out & send the accumulations.

The next steps here are:
- truncating the accumulations when going over a limit
- adding metrics to track that this happens
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Attachment #8876267 - Flags: feedback?(benjamin)

Comment 13

2 years ago
mozreview-review
Comment on attachment 8876267 [details]
bug 1369041 - Allow child processes to discard data when overwhelmed  f?bsmedberg

https://reviewboard.mozilla.org/r/147708/#review152386

This looks like the right approach, but i'm not sure about:
- the way this sends the accumulations
- the chosen factor and if it is sufficient to protect us from unwanted OOMs

::: toolkit/components/telemetry/Scalars.yaml:478
(Diff revision 1)
>        - telemetry-client-dev@mozilla.com
>      release_channel_collection: opt-out
>      record_in_processes:
>        - 'main'
>  
> +telemetry.discarded:

Ok, i kind of whish we had bug 1343855 for scalars already, that would allow to specify all those properties once.

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:56
(Diff revision 1)
>  // With the current limits, events cost us about 1100 bytes each.
>  // This limits memory use to about 10MB.
>  const size_t kEventsArrayHighWaterMark = 10000;
> +// If we are starved we can overshoot the watermark.
> +// This is the multiplier over which we will discard data.
> +const size_t kWaterMarkDiscardFactor = 100;

Is this low enough?
What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...?

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:59
(Diff revision 1)
> +uint32_t gDiscardedAccumulations = 0;
> +uint32_t gDiscardedKeyedAccumulations = 0;

Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion?

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:299
(Diff revision 1)
> +      scalarsToSend.AppendElement(ScalarAction{
> +        ScalarID::TELEMETRY_DISCARDED_ACCUMULATIONS, ScalarActionType::eAdd,
> +        Some(AsVariant(gDiscardedAccumulations)) });

This seems fragile and duplicates the accumulation details.
Can we add an IPC message for this and record the scalars in the parent?
Attachment #8876267 - Flags: review?(gfritzsche)

Comment 14

2 years ago
mozreview-review
Comment on attachment 8876267 [details]
bug 1369041 - Allow child processes to discard data when overwhelmed  f?bsmedberg

https://reviewboard.mozilla.org/r/147708/#review152486

data-r=me - please do a pi-request if this is something we should add to mission control monitoring (sounds like it should be!)
Attachment #8876267 - Flags: review+

Updated

2 years ago
Attachment #8876267 - Flags: feedback?(benjamin)
(Assignee)

Comment 15

2 years ago
mozreview-review-reply
Comment on attachment 8876267 [details]
bug 1369041 - Allow child processes to discard data when overwhelmed  f?bsmedberg

https://reviewboard.mozilla.org/r/147708/#review152386

> Is this low enough?
> What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...?

Prior art for this was here: https://bugzilla.mozilla.org/show_bug.cgi?id=1338555#c8

Current watermarks have us at:

40k  for accumulations
280k for keyed accumulations
400k for scalars
880k for keyed scalars
???? for events
----
1.6M + ??? 

There's a 256M limit on IPC messages (also according to bug 1338555)

With a factor of 100, we're looking at 160M + 100 * (size of events watermark). If the events watermark is 1M, we might still hit the limit if every buffer's full at once.

> Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion?

{Keyed}Accumulations is the name of the type, so it seemed the most sensible to use.

> This seems fragile and duplicates the accumulation details.
> Can we add an IPC message for this and record the scalars in the parent?

Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but everythign is strongly typed. Using the scalar mechanism itself seems an excellent way to ensure we have the flexibility to change our data reporting requirements in the future, if necessary.

I suppose we could add custom IPC for this, but that seems fragile as well, given the number of processes we need to support already (and into the future).

I'd much rather report in-band than out-of-band.
(Assignee)

Comment 16

2 years ago
(In reply to Benjamin Smedberg [:bsmedberg] from comment #14)
> Comment on attachment 8876267 [details]
> bug 1369041 - Allow child processes to discard data when overwhelmed 
> f?bsmedberg
> 
> https://reviewboard.mozilla.org/r/147708/#review152486
> 
> data-r=me - please do a pi-request if this is something we should add to
> mission control monitoring (sounds like it should be!)

I thought Mission Control was more interested in user-impacting criteria, not Telemetry Client Health?
Flags: needinfo?(benjamin)
Mission control is intended to be a system for monitoring all incoming telemetry data in a permanent/reliable way.
Flags: needinfo?(benjamin)
(Assignee)

Comment 18

2 years ago
Then I may just happen to have a list of Telemetry Health criteria that I may wish to add:

- all of these new telementry.discarded scalars
- TELEMETRY_FAILURE_TYPE
- ..actually, probably just about all of the TELEMETRY_* histograms that aren't TELEMETRY_TEST_*
- ping latency
- ping duplicates
- missing subsessions (discontinuous info.profileSubsessionCounter)

Is there a process for having these added? MC seems pretty early-days as of yet, but presumably we could make a table of "measurement, location(hgram, scalar, existing table, etc.), type(count,threshold,etc.), alerting threshold, alert emails" rows.
NI for comment 18.
Flags: needinfo?(benjamin)
(In reply to Chris H-C :chutten from comment #15)
> There's a 256M limit on IPC messages (also according to bug 1338555)
> 
> With a factor of 100, we're looking at 160M + 100 * (size of events
> watermark). If the events watermark is 1M, we might still hit the limit if
> every buffer's full at once.

The rough estimate here puts us at ~10M upper bound:
https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#48

This seems problematic.
Do we have a reason to use a higher factor than say 2x or 5x?

> > Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion?
> 
> {Keyed}Accumulations is the name of the type, so it seemed the most sensible
> to use.

I see, we could rename those too. This could be a good mentored follow-up bug?

> > This seems fragile and duplicates the accumulation details.
> > Can we add an IPC message for this and record the scalars in the parent?
> 
> Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but
> everythign is strongly typed. Using the scalar mechanism itself seems an
> excellent way to ensure we have the flexibility to change our data reporting
> requirements in the future, if necessary.
> 
> I suppose we could add custom IPC for this, but that seems fragile as well,
> given the number of processes we need to support already (and into the
> future).

This leaks out / duplicates the scalar IPC serialization, i don't think we should do that.
One specific concern is that we change semantics on this without changing the types, then overlook changing this (as its hard to discover).

We could:
1) properly share that serialization code or
2) send up this information in a separate message or
3) add test coverage that assures that this doesn't break

2) doesn't seem too bad either, it would involve:
- invoking the ipc message for this on whatever IPC process actor we have (similar to [1])
- routing this through TelemetryIPC.h/.cpp

The first part should assure that we are not forgetting adding this for different supported processes (produces compile error).

1: https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#231
gfritzsche please email pi-request@mozilla.com and we'll add it to the mission control backlog (expect that to be a couple months).
Flags: needinfo?(benjamin)
Comment hidden (mozreview-request)

Comment 23

2 years ago
mozreview-review
Comment on attachment 8876267 [details]
bug 1369041 - Allow child processes to discard data when overwhelmed  f?bsmedberg

https://reviewboard.mozilla.org/r/147708/#review155148

Thanks.

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:22
(Diff revision 2)
> +using mozilla::AsVariant;
> +using mozilla::Some;

Unused?

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:34
(Diff revision 2)
>  using mozilla::Telemetry::Accumulation;
> +using mozilla::Telemetry::DiscardedData;
>  using mozilla::Telemetry::KeyedAccumulation;
>  using mozilla::Telemetry::ScalarActionType;
>  using mozilla::Telemetry::ScalarAction;
> +using mozilla::Telemetry::ScalarID;

Unused?

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:60
(Diff revision 2)
> +uint32_t gDiscardedAccumulations = 0;
> +uint32_t gDiscardedKeyedAccumulations = 0;
> +uint32_t gDiscardedScalarActions = 0;
> +uint32_t gDiscardedKeyedScalarActions = 0;
> +uint32_t gDiscardedChildEvents = 0;

Can we track this in a `DiscardedData`?
Then zero-initialization & reset are trivial (c++ value initialization semantics).

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:303
(Diff revision 2)
> +    discardedData = {
> +      gDiscardedAccumulations,
> +      gDiscardedKeyedAccumulations,
> +      gDiscardedScalarActions,
> +      gDiscardedKeyedScalarActions,
> +      gDiscardedChildEvents};

Nit: The closing bracket should be on a new line.
Attachment #8876267 - Flags: review?(gfritzsche) → review+
Comment hidden (mozreview-request)
(Assignee)

Updated

2 years ago
Blocks: 1375043
Comment hidden (mozreview-request)

Comment 26

2 years ago
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fde21799bb80
Allow child processes to discard data when overwhelmed r=bsmedberg,gfritzsche f?bsmedberg

Comment 27

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/fde21799bb80
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Does this need to be considered for uplift or can it ride the 56 train?
Flags: needinfo?(chutten)
(Assignee)

Comment 29

2 years ago
As mentioned over email, this is a very unlikely crash in a situation we already only tolerably support (megs-and-megs of JSON) so there's no great rush.
Flags: needinfo?(chutten)
Duplicate of this bug: 1344008
You need to log in before you can comment on or make changes to this bug.