Closed Bug 1218576 Opened 9 years ago Closed 8 years ago

Aggregate content process Telemetry in the main process

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla52
Tracking Status
e10s + ---
firefox44 --- affected
firefox52 --- verified

People

(Reporter: vladan, Assigned: chutten)

References

(Depends on 1 open bug, Blocks 5 open bugs)

Details

(Keywords: feature, Whiteboard: [measurement:client][fce-active-legacy])

Attachments

(19 files, 50 obsolete files)

2.42 KB, patch
chutten
: review+
Details | Diff | Splinter Review
3.59 KB, patch
chutten
: review+
Details | Diff | Splinter Review
6.91 KB, patch
chutten
: review+
Details | Diff | Splinter Review
3.51 KB, patch
chutten
: review+
Details | Diff | Splinter Review
7.03 KB, patch
chutten
: review+
Details | Diff | Splinter Review
1.52 KB, patch
chutten
: review+
Details | Diff | Splinter Review
3.63 KB, patch
chutten
: review+
Details | Diff | Splinter Review
1.73 KB, patch
chutten
: review+
Details | Diff | Splinter Review
1.03 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
2.33 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
19.05 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
19.52 KB, patch
chutten
: review+
Details | Diff | Splinter Review
2.65 KB, patch
chutten
: review+
Details | Diff | Splinter Review
3.82 KB, patch
chutten
: review+
Details | Diff | Splinter Review
932 bytes, patch
gfritzsche
: review+
Details | Diff | Splinter Review
4.49 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
2.45 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
1.79 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
1.70 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
We are not submitting child-process Telemetry in subsession pings. We should collect child-process Telemetry every time a parent-process ping is generated
The current design is more or less incidental - we always submit child-process Telemetry if we have data available:
https://dxr.mozilla.org/mozilla-central/rev/6c7c983bce46a460c2766fbdd73283f6d2b03a69/toolkit/components/telemetry/TelemetrySession.jsm#1297

However, currently we are never requesting child payloads from the parent process - so what currently usually happens is:
* Firefox begins to shutdown
* content processes are shutting down and send their Telemetry payload
* the parent process includes it in the "shutdown"/"saved-session" ping

To make this work properly, we need to do a proper design of:
* how and when child-process payloads are collected (per subsession, when child processes go away, ...)
* when child payloads expire
* how to avoid resubmitting child data in subsequent subsessions
Currently we would just resubmit child telemetry with each subsession, potentially duplicating a lot of data.

Assuming we want to receive the child-process Telemetry with low latency in our pipeline, we probably need to collect that data on subsession splits (similar to requestChildPayloads()), requesting the childs subsession measurements.
Issues that come with that design:
* this makes the ping generation asynchronous as we have to wait for all child processes (only content processes?) to respond with their data
* we'd need to protect ourselves from child processes not answering "soon"
* we'll have to solve ordering issues around shutdown

Are we already settled on collecting child-process Telemetry on each subsession split?
Or are there potential alternatives?
Hey vladan, the e10s team would like to know if this potentially interferes with the accuracy of data we will collect and use in deciding rollout. If so we'll need to block on it.
Flags: needinfo?(vladan.bugzilla)
This does not interfere with the accuracy of the data. We've been using full-session pings up to now and full-session pings do contain e10s child telemetry. 
We aren't using subsession pings for the e10s analyses yet (we're not using subsession pings in most analyses)
Flags: needinfo?(vladan.bugzilla)
thanks.
No longer blocks: e10s-rc
Priority: -- → P4
Whiteboard: [measurement:client]
Assignee: nobody → chutten
From conversation: 

* Any calls that could race between children, just drop and assert. MOZ_ASSERT(!isChildProcess) inside things like NewHistogram, Clear, or basically everything that isn't an Accumulate.
* Consider using a content-process prefix like the SUBSESSION_HISTOGRAM_PREFIX for storing/aggregating on the parent side.
* Keep an eye on the Telemetry refactoring bugs (like bug 1261052 and bug 1261063) which may make life easier/harder
Blocks: 1271336
We can make this specifically about the work to send up histogram accumulations to the parent process and aggregate them there.
I'll file off the other dependencies as blocking bug 1272003
Priority: P4 → P3
Summary: [e10s] Subsession pings should also submit content-process telemetry → Aggregate content process Telemetry in the main process
Whiteboard: [measurement:client] → [measurement:client][fce-active]
Instead of accumulating and storing child process telemetry on the child process,
bridge the accumulations to the parent process and store them there with a
"#child" histogram name suffix.

This can be extended to support other types of processes requiring telemetry by
using their process type instead of "child".

slowSQL, threadHangs, and main thread I/O are not transmitted.

Existing infrastructure for the accumulation and storage of telemetry on child
processes has been left intact for the moment.

This patch does not make changes to TelemetrySession or aboutTelemetry that
are necessary for the transmission and display of this recorded Telemetry.

Review commit: https://reviewboard.mozilla.org/r/57962/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57962/
Attachment #8760367 - Flags: review?(gfritzsche)
This is clearly not a complete solution, but with the divvying of the work between bugs, I'm not sure if this is the right place to end work on this bug and start work on another.

I have seen no particular worries with sending each accumulation as it comes in, but I'm putting this through its talos paces before I rule it as actually okay. Otherwise, batching and timers will have to enter the mix.

I could see about blocking init of child process telemetry infrastructure (static arrays and whatnot globally init'd), but that cleanup is a bit architectural in tone, so I'd appreciate feedback for an approach that isn't just peppering Telemetry.cpp with if(XRE_IsContentProcess) return;

Also, at present there's a slight change in semantics. If flag or count histograms don't have a value, then the histogram snapshotting code takes care of init-ing them to false/0 before giving the snapshot to JS. This isn't being done for #child variants in this patch. Again with the architecture, I'm not sure that simply adding "#child" to each the flag histograms being init is the best plan, as we'll soon need to consider other process types.
Comment on attachment 8760367 [details]
bug 1218576 - Child-process Telemetry accumulation in the parent process

https://reviewboard.mozilla.org/r/57962/#review55116

So I've run it through e10s talos a bunch of times: http://mzl.la/1t5N6Dc

Surprisingly there are only two high-confidence regressions (a11yr summary opt e10s, tp5o opt e10s). So, this patch will not be landable. I guess I'll need to batch things and send them off.
Attachment #8760367 - Flags: review-
:rvitillo developed a wrapper for Intel's "Power Gadget" called energia[1] which should be useful to evaluate if whatever solution we chose is not any worse, power-consumption-wise, than present day.

[1]: https://github.com/mozilla/energia
Interestingly, simply putting a dumb "batch until we reach 50 calls, then send them in a single package" on it knocks the a11yr regression down to 4.7%: http://mzl.la/1Xcb9Ne

However, since "timeliness" is also important, I'll see what putting a time bound (instead of the 50-accumulation space bound) on it does.
So, "batching until 1s passes" is better on a11yr (brings the regression down to 4.0%) but is worse on tabpaint (brings it up to a 27.73% regression): https://mzl.la/1YgBfyc

Maybe I should look into running this on idle? We would lose timeliness (is there ever a guarantee of idle time?), and we may end up with higher power consumption (as we would be sending and recording telemetry during an otherwise-idle time)... but it'd save us from incurring costs during times of high use.

... _is_ there a "perform on idle" mechanism?
FWIW, raising the timeout to 2s halves the a11yr regression to 2%, and nixes the regression on tabpaint: https://mzl.la/1Pkwn3W 

This seems like an acceptable level both of regression and patch complexity. Allow me to send it up to mozreview...
Sending each accumulation one by one introduces some nasty talos regresisons.
So, let's batch them, and only send on periodic intervals.

Two notes:
1) Only start the timer when MIN_NUM_ACCUMULATIONS accumulations have
accumulated
2) Do not unarm the timer fully until after SendTelemetry completes

These both, independently, assist in ensuring that accumulations that happen
during SendTelemetry do not force us into a quick loop. That could result in
negative consequences from a power-consumption POV.

Review commit: https://reviewboard.mozilla.org/r/59018/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/59018/
Attachment #8762163 - Flags: review?(gfritzsche)
(In reply to Chris H-C :chutten from comment #15)
> FWIW, raising the timeout to 2s halves the a11yr regression to 2%, and nixes
> the regression on tabpaint: https://mzl.la/1Pkwn3W 
> 
> This seems like an acceptable level both of regression and patch complexity.
> Allow me to send it up to mozreview...

Is that an acceptable regression?
Can we find out of what histograms trigger the main traffic?
Maybe there are easy fixes for the regression like just not collecting them from the content process.

Alternatively we could do comparisons of the timeout for 2s, 10s, 30s, 60s and make a call based on that.
Comment on attachment 8760367 [details]
bug 1218576 - Child-process Telemetry accumulation in the parent process

https://reviewboard.mozilla.org/r/57962/#review55124

::: dom/ipc/PContent.ipdl:552
(Diff revision 1)
>       * Start accessibility engine in content process.
>       */
>      async ActivateA11y();
>  
>      async AppInfo(nsCString version, nsCString buildID, nsCString name, nsCString UAName,
> -                  nsCString ID, nsCString vendor);
> +                  nsCString aID, nsCString vendor);

Unneeded change & it doesn't match the style of the other parameters.

::: dom/ipc/PContent.ipdl:1214
(Diff revision 1)
> +     * Tells the parent to accumulate `sample` on the child histogram for `aId`
> +     * optionally keyed by `key` if it isn't voided.
> +     *
> +     * This is how content processes communicate their Telemetry
> +     */
> +    async Telemetry(ID aId, nsCString aKey, uint32_t aSample);

`Telemetry()` is too generic.
We have scalar measurements coming up and we may need different operations too (clear etc.).
`AccumulateChildHistogram()`?

Also, i think it's much cleaner to use a separate message/function for keyed histograms (instead of logical branching based on optional parameters).
Or are there other arguments against that?

::: toolkit/components/telemetry/Telemetry.h:72
(Diff revision 1)
> + *
> + * @param id - histogram id
> + * @param key - the string key if this is a keyed histogram, a IsVoid string otherwise
> + * @param sample - value to record
> + */
> +void AccumulateChild(ID aId, const nsCString& aKey, uint32_t sample);

`AccumulateChildHistogram()`.
Separate function for keyed histograms?

::: toolkit/components/telemetry/TelemetryComms.h:18
(Diff revision 1)
> +bool RemoteAccumulate(mozilla::Telemetry::ID aId, uint32_t aSample);
> +bool RemoteAccumulate(mozilla::Telemetry::ID aId, const nsCString& aKey, uint32_t aSample);

These functions are only used in `TelemetryHistogram.cpp`, i don't think they need to be available in a shared module.
Let's move them into `TelemetryHistogram.cpp`.

::: toolkit/components/telemetry/TelemetryComms.h:28
(Diff revision 1)
> +} // namespace mozilla
> +
> +namespace IPC {
> +
> +template <>
> +struct ParamTraits<mozilla::Telemetry::ID>

I don't think we need that serializer outside of `TelemetryHistogram.cpp`.

::: toolkit/components/telemetry/TelemetryComms.cpp:31
(Diff revision 1)
> +}
> +
> +bool RemoteAccumulate(mozilla::Telemetry::ID aId, const nsCString& aKey,
> +                      uint32_t aSample)
> +{
> +  if (XRE_IsContentProcess()) {

Also `MOZ_ASSERT(XRE_IsContentProcess())`?

::: toolkit/components/telemetry/TelemetryHistogram.cpp:367
(Diff revision 1)
> +  Histogram *h;
> +  if (child) {
> +    h = knownChildHistograms[id];
> +  } else {
> +    h = knownHistograms[id];
> +  }

Nit: Use a ternary.

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1689
(Diff revision 1)
>  
> +void AccumulateChild(mozilla::Telemetry::ID aID, const nsCString& aKey,
> +                                                 uint32_t aSample)
> +{
> +  MOZ_ASSERT(XRE_IsParentProcess());
> +  MOZ_ASSERT(CanRecordBase());

I don't think this assert can hold.
Consider:
* trigger accumulate in child & queue message
* disable base recording in parent
* receive message
* trigger assert
Attachment #8760367 - Flags: review?(gfritzsche)
https://reviewboard.mozilla.org/r/57962/#review56870

::: toolkit/components/telemetry/TelemetryComms.h:16
(Diff revision 1)
> +namespace mozilla {
> +namespace Telemetry {
> +
> +enum ID : uint32_t;
> +
> +namespace TelemetryComms {

Do we need to repeat `Telemetry` here?
This becomes `mozilla::Telemetry::TelemetryComms`. Why not `mozilla::Telemetry::Comms` or so?

::: toolkit/components/telemetry/TelemetryComms.cpp:31
(Diff revision 1)
> +}
> +
> +bool RemoteAccumulate(mozilla::Telemetry::ID aId, const nsCString& aKey,
> +                      uint32_t aSample)
> +{
> +  if (XRE_IsContentProcess()) {

Easier to read with early return?
`if (!XRE_IsContentProcess()) ...`

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1690
(Diff revision 1)
> +void AccumulateChild(mozilla::Telemetry::ID aID, const nsCString& aKey,
> +                                                 uint32_t aSample)
> +{
> +  MOZ_ASSERT(XRE_IsParentProcess());
> +  MOZ_ASSERT(CanRecordBase());
> +

This will need locking, matching the updated `TelemetryHistogram` interface locking style.
Attachment #8762163 - Flags: review?(gfritzsche)
Comment on attachment 8762163 [details]
bug 1218576 - batch child Telemetry accumulations to limit IPC calls

https://reviewboard.mozilla.org/r/59018/#review56866

::: dom/ipc/ContentParent.cpp:5901
(Diff revision 1)
> +    Telemetry::AccumulateChild(aAccumulations[i].mId,
> +                               aAccumulations[i].mKey,
> +                               aAccumulations[i].mSample);

So this will lock the histogram module for each individual accumulation.

For perf reasons, we should maybe pass the whole batch to `TelemetryHistogram`, let it lock once and accumulate the whole batch under one lock.

::: toolkit/components/telemetry/TelemetryComms.h:18
(Diff revision 1)
>  
>  enum ID : uint32_t;
>  
>  namespace TelemetryComms {
>  
> +struct Accumulation

Can the accumulation helpers here live in `TelemetryHistogram.cpp` or so?

::: toolkit/components/telemetry/TelemetryComms.cpp:20
(Diff revision 1)
> +#define MIN_NUM_ACCUMULATIONS 5
> +#define BATCH_TIMEOUT_MS 2000

Lets use `const` integral types here.

::: toolkit/components/telemetry/TelemetryComms.cpp:52
(Diff revision 1)
> +  // This should prevent us from waking up the thread when otherwise idle
> +  // due to accumulations caused by SendTelemetry
> +  NS_RELEASE(sCommTimer);
> +}
> +
>  bool RemoteAccumulate(mozilla::Telemetry::ID aId, const nsCString& aKey,

How do we avoid races for the `RemoteAccumulate()` calls?
Do we assume that the `TelemetryHistogram` caller takes care of serializing the calls?
If so, we should document that.

If we can move this into `TelemetryHistogram.cpp` it will be much clearer who the callers are and would be easier to reason about thread safety etc.

::: toolkit/components/telemetry/TelemetryComms.cpp:57
(Diff revision 1)
>  bool RemoteAccumulate(mozilla::Telemetry::ID aId, const nsCString& aKey,
>                        uint32_t aSample)
>  {
>    if (XRE_IsContentProcess()) {
> -    mozilla::dom::ContentChild* parentActor = mozilla::dom::ContentChild::GetSingleton();
> -    if (!NS_WARN_IF(!parentActor)) {
> +    accumulations.AppendElement(Accumulation{aId, aKey, aSample});
> +    if (!sCommTimer && accumulations.Length() > MIN_NUM_ACCUMULATIONS) {

I don't think waiting for a minimum number of accumulations is feasible.
In the worst-case we could have one accumulation waiting for minutes for other accumulations to happen.
I think we should never delay longer than `BATCH_TIMEOUT_MS`.

This will be called from different threads.
Is that fine for the timer, does it matter what thread it lives on etc.?
https://reviewboard.mozilla.org/r/57962/#review55124

> Unneeded change & it doesn't match the style of the other parameters.

Sadly it was needed. The identifier `ID` was shadowing the type `ID` brought in by the `using` statement. If you'd prefer not `aID`, then I could maybe just lowercase the d to `Id`

> `Telemetry()` is too generic.
> We have scalar measurements coming up and we may need different operations too (clear etc.).
> `AccumulateChildHistogram()`?
> 
> Also, i think it's much cleaner to use a separate message/function for keyed histograms (instead of logical branching based on optional parameters).
> Or are there other arguments against that?

I was trying to keep the clutter down in PContent, since I was just using it for its IPC. Given the amount of other stuff in the ipdl, I guess a couple other methods wouldn't be noticed :S

> I don't think we need that serializer outside of `TelemetryHistogram.cpp`.

It needs to be reachable by PContent for it to know how to serialize the message.

> Also `MOZ_ASSERT(XRE_IsContentProcess())`?

This is called on the parent process as well. It just returns false in that case (see callers)

> I don't think this assert can hold.
> Consider:
> * trigger accumulate in child & queue message
> * disable base recording in parent
> * receive message
> * trigger assert

True. I'll early return on it.
https://reviewboard.mozilla.org/r/59018/#review56866

> Can the accumulation helpers here live in `TelemetryHistogram.cpp` or so?

Someone was asking me to keep all the IPC-related stuff out of Telemetry\*.cpp and to put as much as possible in its own file. Having this stuff live in TelemetryHistorgram would simplify matters greatly, if this isn't a concern

> I don't think waiting for a minimum number of accumulations is feasible.
> In the worst-case we could have one accumulation waiting for minutes for other accumulations to happen.
> I think we should never delay longer than `BATCH_TIMEOUT_MS`.
> 
> This will be called from different threads.
> Is that fine for the timer, does it matter what thread it lives on etc.?

The problem comes when `SendTelemetry` causes an accumulation (which it does). Then we'll be stuck in a loop of always sending one histogram every `BATCH_TIMEOUT_MS` even when idle.

With only clearing the timer after `SendTelemetry` returns we only account for synchronous accumulations. I do not wish to increase the power load of Firefox.
(In reply to Chris H-C :chutten from comment #22)
> https://reviewboard.mozilla.org/r/59018/#review56866
> 
> > Can the accumulation helpers here live in `TelemetryHistogram.cpp` or so?
> 
> Someone was asking me to keep all the IPC-related stuff out of
> Telemetry\*.cpp and to put as much as possible in its own file. Having this
> stuff live in TelemetryHistorgram would simplify matters greatly, if this
> isn't a concern

I don't know of the concerns behind that, so i can't comment on them.
In general i think, if this does not need to be accessible outside of that module, lets keep it contained.
If logical separation is a concern, we could put it in a different namespace in `TelemetryHistogram.cpp`.

> > I don't think waiting for a minimum number of accumulations is feasible.
> > In the worst-case we could have one accumulation waiting for minutes for other accumulations to happen.
> > I think we should never delay longer than `BATCH_TIMEOUT_MS`.
> > 
> > This will be called from different threads.
> > Is that fine for the timer, does it matter what thread it lives on etc.?
> 
> The problem comes when `SendTelemetry` causes an accumulation (which it
> does). Then we'll be stuck in a loop of always sending one histogram every
> `BATCH_TIMEOUT_MS` even when idle.
> 
> With only clearing the timer after `SendTelemetry` returns we only account
> for synchronous accumulations. I do not wish to increase the power load of
> Firefox.

So that means we always have a fixed number of accumulations in there?
Which are the probes triggered here?
They will potentially mostly be out of date with this, we should consider the impact & their necessity.

Also, using a fixed number of accumulations as the barrier is bound to become outdated when other probes are added.
Is a list of known ids potentially better? It doesn't solve the problem, but might be an improvement.

For simpler and less fragile approaches, is sending every `BATCH_TIMEOUT_MS` actually so bad?
We can use a longer timeout on idle to improve the power & wake situation.
(In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> > > I don't think waiting for a minimum number of accumulations is feasible.
> > > In the worst-case we could have one accumulation waiting for minutes for other accumulations to happen.
> > > I think we should never delay longer than `BATCH_TIMEOUT_MS`.
> > > 
> > > This will be called from different threads.
> > > Is that fine for the timer, does it matter what thread it lives on etc.?
> > 
> > The problem comes when `SendTelemetry` causes an accumulation (which it
> > does). Then we'll be stuck in a loop of always sending one histogram every
> > `BATCH_TIMEOUT_MS` even when idle.
> > 
> > With only clearing the timer after `SendTelemetry` returns we only account
> > for synchronous accumulations. I do not wish to increase the power load of
> > Firefox.
> 
> So that means we always have a fixed number of accumulations in there?
> Which are the probes triggered here?
> They will potentially mostly be out of date with this, we should consider
> the impact & their necessity.
> 
> Also, using a fixed number of accumulations as the barrier is bound to
> become outdated when other probes are added.
> Is a list of known ids potentially better? It doesn't solve the problem, but
> might be an improvement.
> 
> For simpler and less fragile approaches, is sending every `BATCH_TIMEOUT_MS`
> actually so bad?
> We can use a longer timeout on idle to improve the power & wake situation.

The one that comes up synchronously during SendTelemetry is IPC_MESSAGE_SIZE. There's no limit to how many may be triggered asynchronously, and I think no way to know.

I can remove the 5-accumulation minimum, leaving in the rest to specifically avoid synchronously-triggered accumulations. That'll likely be good enough to avoid the more obvious wake loops.
(In reply to Chris H-C :chutten from comment #24)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> > > > I don't think waiting for a minimum number of accumulations is feasible.
> > > > In the worst-case we could have one accumulation waiting for minutes for other accumulations to happen.
> > > > I think we should never delay longer than `BATCH_TIMEOUT_MS`.
> > > > 
> > > > This will be called from different threads.
> > > > Is that fine for the timer, does it matter what thread it lives on etc.?
> > > 
> > > The problem comes when `SendTelemetry` causes an accumulation (which it
> > > does). Then we'll be stuck in a loop of always sending one histogram every
> > > `BATCH_TIMEOUT_MS` even when idle.
> > > 
> > > With only clearing the timer after `SendTelemetry` returns we only account
> > > for synchronous accumulations. I do not wish to increase the power load of
> > > Firefox.
> > 
> > So that means we always have a fixed number of accumulations in there?
> > Which are the probes triggered here?
> > They will potentially mostly be out of date with this, we should consider
> > the impact & their necessity.
> > 
> > Also, using a fixed number of accumulations as the barrier is bound to
> > become outdated when other probes are added.
> > Is a list of known ids potentially better? It doesn't solve the problem, but
> > might be an improvement.
> > 
> > For simpler and less fragile approaches, is sending every `BATCH_TIMEOUT_MS`
> > actually so bad?
> > We can use a longer timeout on idle to improve the power & wake situation.
> 
> The one that comes up synchronously during SendTelemetry is
> IPC_MESSAGE_SIZE. There's no limit to how many may be triggered
> asynchronously, and I think no way to know.

Do we even need this one on the child side?
It seems like recording message sizes in the parent would be sufficient.
 
> I can remove the 5-accumulation minimum, leaving in the rest to specifically
> avoid synchronously-triggered accumulations. That'll likely be good enough
> to avoid the more obvious wake loops.

That sounds good.
Maybe this is just a good example for - in a later bug - considering to not record probes in the child by default and triage which ones we actually need.
One concern that came up:
Lets confirm that we are duplicating the histogram accumulations into "session" & "subsession" histograms, otherwise the moz_aggregator (and potentially other dependencies?) can't update to this change easily.

Longer-term we want those projects to use "main" pings, but that hasn't happened yet.
Yup. internal_HistogramAdd will multiplex a single accumulation to both "sub#" and non-"sub#" hgrams, so no changes there.
Here's the reworked (and now rebased) patch.

I've run it on its 2s [1] and tried out a 5s [2] batch timer. Looking acceptable at 2s, so that's where I've left it.

Now to run it through energia to see how it does there. I don't expect it to show too much in the way of problems, so this is back up for r? (to :froydnj as :gfritzsche's on PTO)

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b170bf1a959a&newProject=try&newRevision=7c3384685ebd65af9b5e8c83c3cd4631499886cd&framework=1&showOnlyImportant=0
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b170bf1a959a&newProject=try&newRevision=a94b094a98dce48aa363ff6cc7c68da5dc22fd27&framework=1&showOnlyImportant=0
Attachment #8760367 - Attachment is obsolete: true
Attachment #8762163 - Attachment is obsolete: true
Attachment #8764976 - Flags: review?(nfroyd)
Comment on attachment 8764976 [details] [diff] [review]
0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch

Review of attachment 8764976 [details] [diff] [review]:
-----------------------------------------------------------------

Lots of comments, but I think this looks reasonable overall.  I'd be curious to know whether avoiding the copying when the timer fires helps with the performance regressions.

f+ only because I'd like to know about the leak situation for debug builds.

::: dom/ipc/PContent.ipdl
@@ +102,5 @@
>  using struct mozilla::dom::FlyWebPublishOptions from "mozilla/dom/FlyWebPublishOptionsIPCSerializer.h";
> +using mozilla::Telemetry::Comms::Accumulation from "mozilla/TelemetryComms.h";
> +using mozilla::Telemetry::Comms::KeyedAccumulation from "mozilla/TelemetryComms.h";
> +using nsTArray<Accumulation> from "mozilla/TelemetryComms.h";
> +using nsTArray<KeyedAccumulation> from "mozilla/TelemetryComms.h";

Wow, does this syntax with nsTArray actually work?  That's frightening.  But I think...

@@ +1215,5 @@
> +     * Messages for communicating child Telemetry to the parent process
> +     */
> +    async AccumulateChildHistogram(nsTArray<Accumulation> aAccumulations);
> +    async AccumulateChildKeyedHistogram(
> +                    nsTArray<KeyedAccumulation> aAccumulations);

...you ought to be able to do what you want with:

async AccumulateChildHistograms(Accumulation[] accumulations);

and similar.  You'll have to change the function signatures for this, but I don't think that should cause problems.

::: toolkit/components/telemetry/TelemetryComms.h
@@ +12,5 @@
> +namespace Telemetry {
> +
> +enum ID : uint32_t;
> +
> +namespace Comms {

We generally don't nest namespaces this deeply unless we have to; can you move the things in this namespace up to the Telemetry namespace?  If nothing else, abbreviating like this is somewhat unusual.

@@ +49,5 @@
> +  static bool Read(const Message* aMsg, PickleIterator* aIter, paramType* aResult)
> +  {
> +    if (!aMsg->ReadUInt32(aIter, reinterpret_cast<uint32_t*>(&(aResult->mId))) ||
> +        !ReadParam(aMsg, aIter, &(aResult->mSample)))
> +      return false;

Please brace this |if| statement.

@@ +65,5 @@
> +  static void Write(Message* aMsg, const paramType& aParam)
> +  {
> +    aMsg->WriteUInt32(aParam.mId);
> +    WriteParam(aMsg, aParam.mSample);
> +		WriteParam(aMsg, aParam.mKey);

The indentation here is wonky, please fix.

@@ +73,5 @@
> +  {
> +    if (!aMsg->ReadUInt32(aIter, reinterpret_cast<uint32_t*>(&(aResult->mId))) ||
> +        !ReadParam(aMsg, aIter, &(aResult->mSample)) ||
> +        !ReadParam(aMsg, aIter, &(aResult->mKey)))
> +      return false;

Please brace this |if| statement.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +182,5 @@
>  // The singleton StatisticsRecorder object for this process.
>  base::StatisticsRecorder* gStatisticsRecorder = nullptr;
>  
> +// For batching and sending child process accumulations to the parent
> +nsITimer* gCommTimer = nullptr;

I have a slight preference for |gIPCTimer|, but up to you.

@@ +184,5 @@
>  
> +// For batching and sending child process accumulations to the parent
> +nsITimer* gCommTimer = nullptr;
> +nsTArray<Accumulation> gAccumulations;
> +nsTArray<KeyedAccumulation> gKeyedAccumulations;

Have you confirmed that these arrays and the associated timer don't leak in debug builds?  They seem a bit tailor-made to intermittently leak.

These arrays are also going to cause static constructors, which we prefer to avoid.  I would like to see a patch that removes the static constructors (e.g. by allocating arrays when they are first touched--which would also force you to think about when they are going to be deallocated).

@@ +206,4 @@
>    mozilla::Telemetry::TELEMETRY_TEST_KEYED_COUNT_INIT_NO_RECORD
>  };
>  
> +const uint32_t kBatchTimeoutMs = 2000;

Some comment as to why this is what it is might be advisable?

@@ +1676,5 @@
> +  if (XRE_IsParentProcess()) {
> +    return false;
> +  }
> +  gAccumulations.AppendElement(Accumulation{aId, aSample});
> +  if (!gCommTimer) {

Can we separate out the code for the timer arming into a separate function?

@@ +1695,5 @@
> +  if (XRE_IsParentProcess()) {
> +    return false;
> +  }
> +  gKeyedAccumulations.AppendElement(KeyedAccumulation{aId, aSample, aKey});
> +  if (!gCommTimer) {

...and then use it here, of course.

@@ +2477,5 @@
> +TelemetryHistogram::CommTimerFired(nsITimer* aTimer, void* aClosure)
> +{
> +  gTelemetryHistogramMutex.Lock();
> +  nsTArray<Accumulation> accumulationsToSend(gAccumulations);
> +  gAccumulations.Clear();

Please do this as:

nsTArray<Accumulation> accumulationsToSend;
accumulationsToSend.SwapElements(gAccumulations);

or:

nsTArray<Accumulation> accumulationsToSend(Move(gAccumulations));

so we're not needlessly copying things.

@@ +2479,5 @@
> +  gTelemetryHistogramMutex.Lock();
> +  nsTArray<Accumulation> accumulationsToSend(gAccumulations);
> +  gAccumulations.Clear();
> +  nsTArray<KeyedAccumulation> keyedAccumulationsToSend(gKeyedAccumulations);
> +  gKeyedAccumulations.Clear();

Same here.

@@ +2480,5 @@
> +  nsTArray<Accumulation> accumulationsToSend(gAccumulations);
> +  gAccumulations.Clear();
> +  nsTArray<KeyedAccumulation> keyedAccumulationsToSend(gKeyedAccumulations);
> +  gKeyedAccumulations.Clear();
> +  gTelemetryHistogramMutex.Unlock();

Please do this with a separate scope:

nsTArray<Accumulation> ...;
nsTArray<KeyedAccumulation> ...;

{
  MutexAutoLock(gTelemetryHistogramMutex);
  // initialize local arrays with SwapElements or move assignment
}

Probably a bit more code, but at least we don't have separate Lock and Unlock calls floating around.
Attachment #8764976 - Flags: review?(nfroyd) → feedback+
Thank you for the review!

My apologies for the blunders. I could've sworn I'd Clear()'d the accumulations arrays in DeInitialize... ah well. Static init bad, lazy init good.

Also: I didn't know about the array notation trick in IPDL. Maybe that should be included in the docs instead of instructions on how to use containers? [0]

Please find attached a patch with your comments addressed. I've thrown it up to try for a broad spectrum e10s talos comparison (base[1] 2s[2] 1s[3]) and I'll post back when I generate enough data for reasonable comparison.

Honestly, though, 2s is timely enough.

[0] https://developer.mozilla.org/en-US/docs/Mozilla/IPDL/Type_Serialization
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b408a62ee6e9
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=567d31460810
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8011a3dfbc14
Attachment #8764976 - Attachment is obsolete: true
Attachment #8765964 - Flags: review?(nfroyd)
Comment on attachment 8765964 [details] [diff] [review]
0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch

Review of attachment 8765964 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, I guess. I'd like to see whether we could decrease the timer frequency.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +214,5 @@
> +// Sending each remote accumulation immediately places undue strain on the
> +// IPC subsystem. Batch the remote accumulations for a period of time before
> +// sending them all at once. This value was chosen as a balance between data
> +// timeliness and performance (see bug 1218576)
> +const uint32_t kBatchTimeoutMs = 2000;

Having something waking up every two seconds doesn't fill me with warm fuzzies, even though I know we already wake up quite a bit.  Have you tried seeing what sort of sizes the accumulation buffers get to with 2s?  With 5s?

@@ +1642,5 @@
> +{
> +  if (gIPCTimer) {
> +    return;
> +  }
> +  CallCreateInstance(NS_TIMER_CONTRACTID, &gIPCTimer);

It might be more efficient to only create the timer once and InitWithFuncCallback each time.  You'd have to have a separate flag for "is the timer currently running", but you'd avoid a bit of allocation churn that way.

@@ +2468,5 @@
> +      keyedAccumulationsToSend.SwapElements(*gKeyedAccumulations);
> +    }
> +  }
> +
> +  mozilla::dom::ContentChild* parentActor = mozilla::dom::ContentChild::GetSingleton();

Is |parentActor| really the right name to use here?  |parentActor| for a Child thing doesn't read very well.
Attachment #8765964 - Flags: review?(nfroyd) → review+
Thanks again!

FWIW, it doesn't fire _every_ 2s. It only fires within 2s of a telemetry accumulation (so long as that accumulation doesn't happen during IPCTimerFired, like IPC_MESSAGE_SIZE does). So, for instance, if the browser is in the background, the timer doesn't fire. (Well, it _does_ fire if the browser's awake and accumulating... but only within 2s of an accumulation)

I have verified this. With firefox.com loaded and the browser minimized, the timer doesn't fire except for the very occasional REFRESH_DRIVER_TICK (Not sure why we're still getting those, but that's an examination for another time perhaps).

InitWithFuncCallback... can you rearm like that with TYPE_ONE_SHOT? *checks* Oh, you sure can. I must admit I was just following the pattern I'd seen elsewhere in the codebase (like docshell's nsPingListener::StartTimeout() ). Having a bool gIPCTimerArmed or somesuch would work (I'm actually a little confused as to why nsITimer doesn't expose that. Maybe it'd encourage static timer init.). For an at-most-every-2s allocation, is it worth it... not sure.

parentActor is what other code called it, so I stuck with it. Simply calling it contentChild would probably do.

In other news, I have found out while trying to get this code to populate childPayloads that it doesn't handle subsession data collection properly. More patches incoming to handle that, and the necessary TelemetrySession glue.
Attachment #8765964 - Attachment is obsolete: true
Attachment #8768060 - Flags: review+
Attachment #8768061 - Flags: review?(gfritzsche)
Attachment #8768063 - Flags: review?(gfritzsche)
Attachment #8768064 - Flags: review?(gfritzsche)
Attachment #8768065 - Flags: review?(gfritzsche)
Attachment #8768066 - Flags: review?(gfritzsche)
Attachment #8768068 - Flags: review?(gfritzsche)
Attachment #8768069 - Flags: review?(gfritzsche)
Sorry for the large # of patches, I figure this way it'll be easier to review and think about.

This puts the codebase in a place where the new parent-side child telemetry aggregation is being used to collect and send child telemetry in the same format as it is currently being sent, with one additional (stripped-down) childPayload with subsession information in it.

Still not done: about:telemetry (see bug 1277504), cleaning up (bug 1277503), and changing how/when we send the payloads (see bug 1281795)
Attachment #8768072 - Flags: review?(gfritzsche)
I have two high-level concerns here that we can address in separate patches.
Let me know if either of these are already addressed in some patch.

I think we need opt-in Telemetry from the parent process to track how often those accumulations are sent, otherwise we are blind to the real-world impact and regressions.
This could be either:
* count of batches received (and we break it down vs. sessionLength/subsessionLength) or
* time between batches received

Given the batching mechanism, there is also a concern about attributing accumulations to the wrong subsession:
* accumulate values in child for up to 2sec (or whatever the batching delay is)
* clear subsession histograms in parent
* batching delay ends, old subsession values are sent up to the parent

I can think of two ways to deal with this (both with still slight racing risk):
* send a "clear subsession accumulations" message to the child
* track in parent & child which subsession no. we are in
Comment on attachment 8768060 [details] [diff] [review]
0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch

Review of attachment 8768060 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1858,5 @@
>    gKeyedHistograms.Clear();
>    gAddonMap.Clear();
> +  if (gAccumulations) {
> +    delete gAccumulations;
> +    gAccumulations = nullptr;

You could just make this a smart pointer.
Then the three lines shrink to just `gAccumulations = nullptr;` or so.

@@ +1862,5 @@
> +    gAccumulations = nullptr;
> +  }
> +  if (gKeyedAccumulations) {
> +    delete gKeyedAccumulations;
> +    gKeyedAccumulations = nullptr;

Ditto.

@@ +1863,5 @@
> +  }
> +  if (gKeyedAccumulations) {
> +    delete gKeyedAccumulations;
> +    gKeyedAccumulations = nullptr;
> +  }

What about clearing gIPCTimer?
(In reply to Georg Fritzsche [:gfritzsche] from comment #44)
> Given the batching mechanism, there is also a concern about attributing
> accumulations to the wrong subsession:
> * accumulate values in child for up to 2sec (or whatever the batching delay
> is)
> * clear subsession histograms in parent
> * batching delay ends, old subsession values are sent up to the parent
> 
> I can think of two ways to deal with this (both with still slight racing
> risk):
> * send a "clear subsession accumulations" message to the child
> * track in parent & child which subsession no. we are in

Per IRC, this is best addressed in a follow-up bug.
The content process actually only sends one accumulation and doesn't differentiate between subsession and "plain" histograms.

A better approach could be to:
* in the child: timestamp the accumulations
* in the parent:
  * store the timestamp for the last subsession reset
  * drop all accumulations with a timestamp before the last subsession reset
Blocks: 1284860
Comment on attachment 8768061 [details] [diff] [review]
0002-bug-1218576-Support-subsession-hgrams-for-child-aggr.patch

Review of attachment 8768061 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +397,5 @@
>    }
>  
> +  CharPtrEntryType *entry = nullptr;
> +  nsDependentCString histogramName(name);
> +  if (StringEndsWith(histogramName, NS_LITERAL_CSTRING(CHILD_HISTOGRAM_SUFFIX))) {

Use NS_NAMED_LITERAL_CSTRING(), then you can also use `.Length()` below and drop the `+1` juggle.

@@ +402,5 @@
> +    auto root = Substring(histogramName, 0,
> +                          histogramName.Length() - sizeof(CHILD_HISTOGRAM_SUFFIX) + 1);
> +    entry = gHistogramMap.GetEntry(PromiseFlatCString(root).get());
> +  } else {
> +    entry = gHistogramMap.GetEntry(name);

How about using a helper `GetHistogramRootName()` or so?
Then all this becomes something like:
> CharPtrEntryType *entry = gHistogramMap.GetEntry(GetHistogramRootName(...)...)

@@ +540,4 @@
>      return nullptr;
>    }
>  
> +  std::string name = existing.histogram_name();

`const std::string&`, otherwise you copy.

@@ +543,5 @@
> +  std::string name = existing.histogram_name();
> +  bool isChild = name.size() > sizeof CHILD_HISTOGRAM_SUFFIX &&
> +                 name.compare(name.size() - sizeof CHILD_HISTOGRAM_SUFFIX,
> +                              sizeof CHILD_HISTOGRAM_SUFFIX,
> +                              CHILD_HISTOGRAM_SUFFIX) == 0;

Can we move this into a helper overload of `StringEndsWith()` or so to reason about separately?
Attachment #8768061 - Flags: review?(gfritzsche)
Comment on attachment 8768063 [details] [diff] [review]
0003-bug-1218576-Pre-allocate-child-keyed-histograms-in-t.patch

Review of attachment 8768063 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1850,5 @@
> +      nsCString childId(id);
> +      childId.AppendLiteral(CHILD_HISTOGRAM_SUFFIX);
> +      gKeyedHistograms.Put(childId,
> +                           new KeyedHistogram(id, expiration, h.histogramType,
> +                                              h.min, h.max, h.bucketCount, h.dataset));

Hm, what code depends on this?
Maybe we can fix that or at least follow up on it in a future bug.
(In reply to Georg Fritzsche [:gfritzsche] from comment #48)
> Comment on attachment 8768063 [details] [diff] [review]
> 0003-bug-1218576-Pre-allocate-child-keyed-histograms-in-t.patch
> 
> Review of attachment 8768063 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryHistogram.cpp
> @@ +1850,5 @@
> > +      nsCString childId(id);
> > +      childId.AppendLiteral(CHILD_HISTOGRAM_SUFFIX);
> > +      gKeyedHistograms.Put(childId,
> > +                           new KeyedHistogram(id, expiration, h.histogramType,
> > +                                              h.min, h.max, h.bucketCount, h.dataset));
> 
> Hm, what code depends on this?
> Maybe we can fix that or at least follow up on it in a future bug.

GetKeyedHistogramById returns NS_ERROR_FAILURE if the specified keyed hgram isn't in gKeyedHistograms. This ends up exploding things a bit in TelemetrySession.jsm's getKeyedHistograms. (by contrast, getHistograms uses the histogramSnapshots API to get a copy of all normal hgrams into JS, then iterates. I don't think this approach would work as well for keyed, as each keyed hgram is snapshotted individually)

With a KeyedHistogram being two nsCStrings and a handful of integral types, I'm not too worried about size/perf implications, which is one of the reasons I opted for the minimally-invasive approach.
Comment on attachment 8768065 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8768065 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +41,4 @@
>  const REASON_ENVIRONMENT_CHANGE = "environment-change";
>  const REASON_SHUTDOWN = "shutdown";
>  
> +const CHILD_PROCESS_TYPES = {

`CHILD_PROCESS_HISTOGRAM_SUFFIXES`?

@@ +892,5 @@
>  
>      for (let name of registered) {
> +      if (childProcessType) {
> +        name += childProcessType;
> +      }

Let's move that out of the loop?
> if (childProcessType) {
>   registered = registered.map(...);

Bonus: We could also move out the "STARTUP_" part and reduce nesting:
> registered = registered.concat(registered.map(...))

@@ +897,5 @@
>        for (let n of [name, "STARTUP_" + name]) {
>          if (n in hls) {
>            // Omit telemetry test histograms outside of tests.
>            if (n.startsWith('TELEMETRY_TEST_') && this._testing == false) {
>              this._log.trace("getHistograms - Skipping test histogram: " + n);

... and this could also be moved out using `registered.filter()`.
Ok, thats getting a bit out of scope... can also move that to future cleanup bugs :)

@@ +946,5 @@
>          continue;
>        }
> +      let keyed = Telemetry.getKeyedHistogramById(childProcessType ?
> +                                                  id + childProcessType :
> +                                                  id);

Instead use `registered.map()` above?

@@ +1263,5 @@
>      let payloadObj = {
>        ver: PAYLOAD_VERSION,
>        simpleMeasurements: simpleMeasurements,
> +      histograms: Utils.isContentProcess ?
> +                    undefined :

Let's not add them here then:
> let payloadObj = {
>   ...
> };
> 
> if (!Utils.isContentProcess) {
>   payloadObj.histograms = ...
>   payloadObj.keyedHistograms = ...

@@ +1319,3 @@
>      }
> +    // Add the latest child subsession info. It only has hgrams
> +    payloadObj.childPayloads.push({

To keep the current behavior working, isn't it enough to just always add one childPayload with the current non-subsession histograms/keyedHistograms?
Attachment #8768065 - Flags: review?(gfritzsche)
Attachment #8768064 - Flags: review?(gfritzsche) → review+
Comment on attachment 8768063 [details] [diff] [review]
0003-bug-1218576-Pre-allocate-child-keyed-histograms-in-t.patch

Review of attachment 8768063 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1845,5 @@
>      const nsDependentCString expiration(h.expiration());
>      gKeyedHistograms.Put(id, new KeyedHistogram(id, expiration, h.histogramType,
>                                                  h.min, h.max, h.bucketCount, h.dataset));
> +    if (XRE_IsParentProcess()) {
> +      // must create registered child keyed histograms as well

Nit: Casing ("Must") and trailing ".".
Also comment on why this is needed, so we can potentially clean this up in the future.
Attachment #8768063 - Flags: review?(gfritzsche) → review+
(In reply to Georg Fritzsche [:gfritzsche] from comment #50)
> Comment on attachment 8768065 [details] [diff] [review]
> 0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch
> 
> Review of attachment 8768065 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +1319,3 @@
> >      }
> > +    // Add the latest child subsession info. It only has hgrams
> > +    payloadObj.childPayloads.push({
> 
> To keep the current behavior working, isn't it enough to just always add one
> childPayload with the current non-subsession histograms/keyedHistograms?

Wouldn't we lose simpleMeasurements (things like uptime/totaltime necessary for determine rates per usage hour, etc) that way?
Comment on attachment 8768066 [details] [diff] [review]
0006-bug-1218576-Support-remote-accumulation-via-JS-histo.patch

Review of attachment 8768066 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM with the below fixed.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +815,4 @@
>    nsresult Add(const nsCString& key, uint32_t aSample);
>    void Clear(bool subsession);
>  
> +  nsresult EnumId(mozilla::Telemetry::ID& id);

`GetEnumId()`

@@ +1100,4 @@
>  
>  namespace {
>  
> +bool internal_RemoteAccumulate(mozilla::Telemetry::ID, uint32_t);

Let's not mix this in here.
If we need to pre-declare, let's do so in a separate section.
In this case though, we can just move the whole `JSHistogram_*` and `JSKeyedHistogram_*` sections to after the section defining `internal_RemoteAccumulate()`.

@@ +1343,4 @@
>    }
>  }
>  
> +bool internal_RemoteAccumulate(mozilla::Telemetry::ID, const nsCString&, uint32_t);

Ditto.
Attachment #8768066 - Flags: review?(gfritzsche) → review+
(In reply to Chris H-C :chutten from comment #52)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #50)
> > Comment on attachment 8768065 [details] [diff] [review]
> > 0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch
> > 
> > Review of attachment 8768065 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > @@ +1319,3 @@
> > >      }
> > > +    // Add the latest child subsession info. It only has hgrams
> > > +    payloadObj.childPayloads.push({
> > 
> > To keep the current behavior working, isn't it enough to just always add one
> > childPayload with the current non-subsession histograms/keyedHistograms?
> 
> Wouldn't we lose simpleMeasurements (things like uptime/totaltime necessary
> for determine rates per usage hour, etc) that way?

I see, that is a bit premature then.
Still, i'm not sure what the subsession histogram handling is about.
As far as i know, all that we need to keep working right now (for e10s analysis and the aggregator) is the `childPayload` behavior in "saved-session" for whole-session histograms?
Comment on attachment 8768068 [details] [diff] [review]
0007-bug-1218576-Update-test_ChildHistograms-for-child-te.patch

Review of attachment 8768068 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_ChildHistograms.js
@@ +68,4 @@
>    if (runningInParent) {
>      // Make sure we don't generate unexpected pings due to pref changes.
>      setEmptyPrefWatchlist();
> +    // BrowserTestUtils not available on child

Nit: Slightly more complete sentence please?

@@ +68,5 @@
>    if (runningInParent) {
>      // Make sure we don't generate unexpected pings due to pref changes.
>      setEmptyPrefWatchlist();
> +    // BrowserTestUtils not available on child
> +    Cu.import("resource://testing-common/BrowserTestUtils.jsm");

Hm, that is from `browser/`, which we shouldn't depend on from `toolkit/`.
I see e.g. that `ContentTaskUtils.jsm` has it in a more generic location:
https://dxr.mozilla.org/mozilla-central/search?q=waitForCondition+-path%3Abrowser&redirect=false

@@ +81,5 @@
> +    return payload &&
> +           "childPayloads" in payload &&
> +           "histograms" in payload.childPayloads[0] &&
> +           "TELEMETRY_TEST_FLAG" in payload.childPayloads[0].histograms;
> +  }).then(() => {

yield P.then()?
Why not:
> yield waitForCondition(...);
>
> const payload = ...
Attachment #8768068 - Flags: review?(gfritzsche)
Comment on attachment 8768069 [details] [diff] [review]
0008-bug-1218576-Update-test-browser_use_counters.js-to-u.patch

Review of attachment 8768069 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/test/browser_use_counters.js
@@ +104,5 @@
> +  let gather = () => [
> +    telemetry.getHistogramById("USE_COUNTER2_" + use_counter_middlefix + "_PAGE").snapshot().sum,
> +    telemetry.getHistogramById("USE_COUNTER2_" + use_counter_middlefix + "_DOCUMENT").snapshot().sum,
> +    telemetry.getHistogramById("CONTENT_DOCUMENTS_DESTROYED").snapshot().sum,
> +    telemetry.getHistogramById("TOP_LEVEL_CONTENT_DOCUMENTS_DESTROYED").snapshot().sum,

This actually seems to look at parent histograms, not child histograms?
Does that mean we are accumulating into the parent histogram?
Or is getHistogramById() returning a child histogram?
Attachment #8768069 - Flags: review?(gfritzsche)
Comment on attachment 8768072 [details] [diff] [review]
0009-bug-1218576-Assert-we-re-the-parent-process-when-cle.patch

Review of attachment 8768072 [details] [diff] [review]:
-----------------------------------------------------------------

Good riddance... with the below fixed.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +620,4 @@
>  void
>  internal_HistogramClear(Histogram& aHistogram, bool onlySubsession)
>  {
> +  MOZ_ASSERT(XRE_IsParentProcess());

Lets also early-return on `!XRE_IsParentProcess()`.

@@ +960,4 @@
>  void
>  KeyedHistogram::Clear(bool onlySubsession)
>  {
> +  MOZ_ASSERT(XRE_IsParentProcess());

Ditto.
Attachment #8768072 - Flags: review?(gfritzsche) → review+
Comment on attachment 8768065 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8768065 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1568,5 @@
>  
> +      // We use and clear the subsession hgrams here to emulate the current
> +      // saved-session-like behaviour for individual child payloads.
> +      message.data.histograms = this.getHistograms(true, true, CHILD_PROCESS_TYPES.CONTENT);
> +      message.data.keyedHistograms = this.getKeyedHistograms(true, true, CHILD_PROCESS_TYPES.CONTENT);

Why do these pass `true, true`?
This seems unneeded for "saved-session" like semantics - we probably want the non-subsession histograms and to not clear subsession data.
Depends on: 1285006
Comment on attachment 8768065 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8768065 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1319,3 @@
>      }
> +    // Add the latest child subsession info. It only has hgrams
> +    payloadObj.childPayloads.push({

Lets drop this - we don't need the remaining values to keep existing behavior working.

@@ +1568,5 @@
>  
> +      // We use and clear the subsession hgrams here to emulate the current
> +      // saved-session-like behaviour for individual child payloads.
> +      message.data.histograms = this.getHistograms(true, true, CHILD_PROCESS_TYPES.CONTENT);
> +      message.data.keyedHistograms = this.getKeyedHistograms(true, true, CHILD_PROCESS_TYPES.CONTENT);

We talked about it and cleared up my understanding, this seems fine to me now.
Blocks: 1285568
Comment on attachment 8768065 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8768065 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +41,5 @@
>  const REASON_ENVIRONMENT_CHANGE = "environment-change";
>  const REASON_SHUTDOWN = "shutdown";
>  
> +const CHILD_PROCESS_TYPES = {
> +  CONTENT: "#child",

I don't think we should leak this out of the Telemetry core.
Lets file a follow-up on properly rewriting the snapshot parts of nsITelemetry.
This would allow snapshotting histograms/keyedHistograms for subsession/session, opt-in/opt-out, process.
Attachment #8768060 - Attachment is obsolete: true
Attachment #8770187 - Flags: review+
Attachment #8768061 - Attachment is obsolete: true
Attachment #8770189 - Flags: review?(gfritzsche)
Attachment #8768063 - Attachment is obsolete: true
Attachment #8770190 - Flags: review+
Attachment #8768064 - Attachment is obsolete: true
Attachment #8770191 - Flags: review+
Attachment #8768065 - Attachment is obsolete: true
Attachment #8770192 - Flags: review?(gfritzsche)
Attachment #8768066 - Attachment is obsolete: true
Attachment #8770193 - Flags: review+
Attachment #8768068 - Attachment is obsolete: true
Attachment #8770194 - Flags: review?(gfritzsche)
Attachment #8768069 - Attachment is obsolete: true
Attachment #8770195 - Flags: review?(gfritzsche)
Attachment #8768072 - Attachment is obsolete: true
Attachment #8770196 - Flags: review+
As identified in our conversations, one more patch is needed to bring this back to parity: we could've lost at most 2s of accumulations during content process shutdown.

This patch takes care of that case.
Attachment #8770197 - Flags: review?(gfritzsche)
Comment on attachment 8770189 [details] [diff] [review]
0002-bug-1218576-Support-subsession-hgrams-for-child-aggr.patch

Review of attachment 8770189 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +537,5 @@
>  
>  #if !defined(MOZ_WIDGET_GONK) && !defined(MOZ_WIDGET_ANDROID)
> +
> +bool
> +StringEndsWith(const std::string& name, const std::string& suffix)

Let's move this to the "// PRIVATE: Misc small helpers" section.
Attachment #8770189 - Flags: review?(gfritzsche) → review+
Comment on attachment 8770192 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8770192 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +879,4 @@
>                                         : Ci.nsITelemetry.DATASET_RELEASE_CHANNEL_OPTOUT;
>    },
>  
> +  getHistograms: function getHistograms(subsession, clearSubsession, childProcessSuffix) {

`..., childProcessSuffix = null)`

@@ +930,4 @@
>      return ret;
>    },
>  
> +  getKeyedHistograms: function(subsession, clearSubsession, childProcessSuffix) {

`..., childProcessSuffix = null)`

@@ +946,5 @@
>          continue;
>        }
> +      let keyed = Telemetry.getKeyedHistogramById(childProcessSuffix ?
> +                                                  id + childProcessSuffix :
> +                                                  id);

We could also .map() this outside of the loop.

@@ +1560,5 @@
>        let source = message.data.childUUID;
>        delete message.data.childUUID;
>  
> +      // We use and clear the subsession hgrams here to emulate the current
> +      // saved-session-like behaviour for individual child payloads.

Expand a bit here on why this works for now.
Something like:
"We only expect one child process for most users at this point, so snapshotting and resetting the child subsession histograms when a child process shuts down gives us their histogram accumulations over its process lifetime."

Also nit: hgrams -> histograms.
Attachment #8770192 - Flags: review?(gfritzsche) → review+
Comment on attachment 8770192 [details] [diff] [review]
0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch

Review of attachment 8770192 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1562,5 @@
>  
> +      // We use and clear the subsession hgrams here to emulate the current
> +      // saved-session-like behaviour for individual child payloads.
> +      message.data.histograms = this.getHistograms(true, true, CHILD_PROCESS_HISTOGRAM_SUFFIXES.CONTENT);
> +      message.data.keyedHistograms = this.getKeyedHistograms(true, true, CHILD_PROCESS_HISTOGRAM_SUFFIXES.CONTENT);

Did you make sure we are not also resetting the "child subsession histograms" on subsession splits?
Comment on attachment 8770194 [details] [diff] [review]
0007-bug-1218576-Update-test_ChildHistograms-for-child-te.patch

Review of attachment 8770194 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1317,4 @@
>        payloadObj.childPayloads = [];
>      }
>  
> +    if (this._testing) {

This means we are putting this on every "main" or "saved-session" ping in xpcshell-tests (and potentially other tests).
If you mean to specifically put this on the ping test_ChildHistograms.js requests, you should check for `reason == "test-ping"`.

But really i think we can just wait for the measurements to show up in the test ping from the test code, without adding anything explicitly here.
E.g. something like:
* get test-ping payload
* oldLength = payload.childPayloads.length
* run child test
* waitForCondition() on `childPayloads.length > oldLength`
Attachment #8770194 - Flags: review?(gfritzsche)
Comment on attachment 8770195 [details] [diff] [review]
0008-bug-1218576-Update-test-browser_use_counters.js-to-u.patch

Review of attachment 8770195 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/test/browser_use_counters.js
@@ +99,4 @@
>    });
>  }
>  
> +function grabHistogramsFromContent(use_counter_middlefix, page_before) {

Make it clear that the second argument is optional, e.g. `page_before=null`.
Attachment #8770195 - Flags: review?(gfritzsche) → review+
Comment on attachment 8770197 [details] [diff] [review]
0010-bug-1218576-Ensure-remaining-batched-telemetry-is-fl.patch

Review of attachment 8770197 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +2370,5 @@
> +NS_IMETHODIMP
> +TelemetryImpl::FlushBatchedTelemetry()
> +{
> +  TelemetryHistogram::CancelIPCTimer();
> +  TelemetryHistogram::IPCTimerFired(nullptr, nullptr);

Lets just make this into one call `TelemetryHistogram::FlushBatchedChildTelemetry()` or so?
I don't think we need to expose `CancelIPCTimer()`.

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +431,5 @@
> +  /**
> +   * Immediately sends any Telemetry batched on this process. This is
> +   * intended only to be used on process shutdown.
> +   */
> +  void flushBatchedTelemetry();

`flushBatchedChildTelemetry()`?
Attachment #8770197 - Flags: review?(gfritzsche) → feedback+
(In reply to Georg Fritzsche [:gfritzsche] from comment #76)
> Comment on attachment 8770197 [details] [diff] [review]
> 0010-bug-1218576-Ensure-remaining-batched-telemetry-is-fl.patch
> 
> Review of attachment 8770197 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/Telemetry.cpp
> @@ +2370,5 @@
> > +NS_IMETHODIMP
> > +TelemetryImpl::FlushBatchedTelemetry()
> > +{
> > +  TelemetryHistogram::CancelIPCTimer();
> > +  TelemetryHistogram::IPCTimerFired(nullptr, nullptr);
> 
> Lets just make this into one call
> `TelemetryHistogram::FlushBatchedChildTelemetry()` or so?
> I don't think we need to expose `CancelIPCTimer()`.

Right now only IPCTimerFired has locking semantics that differ from the rest of TelemetryHistogram::*. If I introduce a TelemetryHistogram::FlushBatchedChildTelemetry() it'll need to be of a form that takes the lock for cancelling the IPC timer, but releases it to call IPCTimerFired, which would make it a second TH::* that has odd locking semantics. Would that be acceptable?

(In reply to Georg Fritzsche [:gfritzsche] from comment #73)
> Comment on attachment 8770192 [details] [diff] [review]
> 0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch
> 
> Review of attachment 8770192 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetrySession.jsm
> @@ +1562,5 @@
> >  
> > +      // We use and clear the subsession hgrams here to emulate the current
> > +      // saved-session-like behaviour for individual child payloads.
> > +      message.data.histograms = this.getHistograms(true, true, CHILD_PROCESS_HISTOGRAM_SUFFIXES.CONTENT);
> > +      message.data.keyedHistograms = this.getKeyedHistograms(true, true, CHILD_PROCESS_HISTOGRAM_SUFFIXES.CONTENT);
> 
> Did you make sure we are not also resetting the "child subsession
> histograms" on subsession splits?

...uh oh. And, even worse, getHistograms(x, true, x) clears the parent subsession histograms: every time a child process shuts down, this code clears the parent subsession without using the data.

*sigh* I'm really starting to dislike subsessions.

From involved discussions on IRC, consensus was formed around the following: since the whole point is aggregating the hgrams, just put the aggregated child hgrams in the root payload a la bug 1281795, changing moztelemetry and mozaggregator. childPayloads will still exist to provide simpleMeasurements, threadHangs, and the other stuff that currently lives there... but it will no longer contain histograms.

This _should_ be fine as nothing I've found in a casual scan depends on cross-referencing (within a given childPayload) anything in histograms/keyedHistograms with anything else. But we'll see.

(In reply to Georg Fritzsche [:gfritzsche] from comment #72)
> Comment on attachment 8770192 [details] [diff] [review]
> 0005-bug-1218576-Use-child-aggregation-to-populate-childP.patch
> 
> Review of attachment 8770192 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +946,5 @@
> >          continue;
> >        }
> > +      let keyed = Telemetry.getKeyedHistogramById(childProcessSuffix ?
> > +                                                  id + childProcessSuffix :
> > +                                                  id);
> 
> We could also .map() this outside of the loop.

Unfortunately not as we'd have to unmap it in order to remove the `childProcessSuffix` from `id` in order to put it in `ret`. Either we have a conditional on `childProcessSuffix` up here, or one down at `ret[id] = {};`
Attachment #8770192 - Flags: review+ → review-
(In reply to Chris H-C :chutten from comment #77)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #76)
> > Comment on attachment 8770197 [details] [diff] [review]
> > 0010-bug-1218576-Ensure-remaining-batched-telemetry-is-fl.patch
> > 
> > Review of attachment 8770197 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: toolkit/components/telemetry/Telemetry.cpp
> > @@ +2370,5 @@
> > > +NS_IMETHODIMP
> > > +TelemetryImpl::FlushBatchedTelemetry()
> > > +{
> > > +  TelemetryHistogram::CancelIPCTimer();
> > > +  TelemetryHistogram::IPCTimerFired(nullptr, nullptr);
> > 
> > Lets just make this into one call
> > `TelemetryHistogram::FlushBatchedChildTelemetry()` or so?
> > I don't think we need to expose `CancelIPCTimer()`.
> 
> Right now only IPCTimerFired has locking semantics that differ from the rest
> of TelemetryHistogram::*. If I introduce a
> TelemetryHistogram::FlushBatchedChildTelemetry() it'll need to be of a form
> that takes the lock for cancelling the IPC timer, but releases it to call
> IPCTimerFired, which would make it a second TH::* that has odd locking
> semantics. Would that be acceptable?

Yes.

> From involved discussions on IRC, consensus was formed around the following:
> since the whole point is aggregating the hgrams, just put the aggregated
> child hgrams in the root payload a la bug 1281795, changing moztelemetry and
> mozaggregator. childPayloads will still exist to provide simpleMeasurements,
> threadHangs, and the other stuff that currently lives there... but it will
> no longer contain histograms.
> 
> This _should_ be fine as nothing I've found in a casual scan depends on
> cross-referencing (within a given childPayload) anything in
> histograms/keyedHistograms with anything else. But we'll see.

I send out a mail to fhr-dev about this, let's see if that brings up any other depencencies:
https://mail.mozilla.org/pipermail/fhr-dev/2016-July/000978.html

> (In reply to Georg Fritzsche [:gfritzsche] from comment #72)
> > Review of attachment 8770192 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > @@ +946,5 @@
> > >          continue;
> > >        }
> > > +      let keyed = Telemetry.getKeyedHistogramById(childProcessSuffix ?
> > > +                                                  id + childProcessSuffix :
> > > +                                                  id);
> > 
> > We could also .map() this outside of the loop.
> 
> Unfortunately not as we'd have to unmap it in order to remove the
> `childProcessSuffix` from `id` in order to put it in `ret`. Either we have a
> conditional on `childProcessSuffix` up here, or one down at `ret[id] = {};`

Never mind then.
Just changing #child -> #content
Attachment #8770187 - Attachment is obsolete: true
Attachment #8771095 - Flags: review+
reworded commit message to drop reference to childPayloads.
Attachment #8770189 - Attachment is obsolete: true
Attachment #8771096 - Flags: review+
Alrighty, here's the rework of TelemetrySession to place content process telemetry nearer the root of ping payloads.
Attachment #8770192 - Attachment is obsolete: true
Attachment #8771097 - Flags: review?(gfritzsche)
With processes.child.histograms now present on all test payloads as the telemetry comes in (how novel!), this test gets a bit more readable (and no longer needs any cheaty code in TS.jsm)
Attachment #8770194 - Attachment is obsolete: true
Attachment #8771098 - Flags: review?(gfritzsche)
#child -> #content
Attachment #8770195 - Attachment is obsolete: true
Attachment #8771099 - Flags: review+
This one may require a little explanation (also included in the commit message).

Basically, FlushBatchedChildTelemetry now just calls IPCTimerFired(nullptr, nullptr) as I've discovered that IPCTimerFired is threadsafe. I mean, there's a race as to which thread (the timer's or any of the ones that could theoretically be simultaneously calling FlushBatchedChildTelemetry) will be the one responsible for sending those batched accumulations (assuming there are any)... but that luckily doesn't matter, and the state remains consistent throughout.
Attachment #8770197 - Attachment is obsolete: true
Attachment #8771103 - Flags: review?(gfritzsche)
Depends on: 1286951
Comment on attachment 8771097 [details] [diff] [review]
0005-bug-1218576-Move-aggregated-child-telemetry-to-proce.patch

Review of attachment 8771097 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1268,5 @@
>  
> +    if (!Utils.isContentProcess) {
> +      let histograms = protect(() => this.getHistograms(isSubsession, clearSubsession));
> +      let keyedHistograms = protect(() => this.getKeyedHistograms(isSubsession, clearSubsession));
> +      payloadObj.histograms = histograms[HISTOGRAM_SUFFIXES.PARENT];

When the call protect in `protect()` throws, it will return `null`.
Indexing `null` will throw.

We can:
(1) change it to `protect(fn, default=null)` and call `protect(..., {})`
(2) or do explicit `null` checks here

@@ +1271,5 @@
> +      let keyedHistograms = protect(() => this.getKeyedHistograms(isSubsession, clearSubsession));
> +      payloadObj.histograms = histograms[HISTOGRAM_SUFFIXES.PARENT];
> +      payloadObj.keyedHistograms = keyedHistograms[HISTOGRAM_SUFFIXES.PARENT];
> +      payloadObj.processes = {
> +        content: {

When updating to the latest m-c, `payloadObj.processes.parent.scalars` already exists.
Attachment #8771097 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8771098 [details] [diff] [review]
0007-bug-1218576-Update-test_ChildHistograms-for-child-te.patch

Review of attachment 8771098 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_ChildHistograms.js
@@ +78,5 @@
> +  yield ContentTaskUtils.waitForCondition(() => {
> +    let payload = TelemetrySession.getPayload("test-ping");
> +    return payload &&
> +           "processes" in payload &&
> +           "histograms" in payload.processes.content &&

You should check for `content in payload.processes` too.

@@ +85,3 @@
>    const payload = TelemetrySession.getPayload("test-ping");
> +  Assert.ok("processes" in payload && "content" in payload.processes,
> +            "Should have content child payload.");

Let's make those two asserts.
Attachment #8771098 - Flags: review?(gfritzsche) → review+
Comment on attachment 8771103 [details] [diff] [review]
0010-bug-1218576-Ensure-remaining-batched-telemetry-is-fl.patch

Review of attachment 8771103 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +428,5 @@
>     */
>    void clearScalars();
> +
> +  /**
> +   * Immediately sends any Telemetry batched on this process. This is

"... on this process to the parent process."
Attachment #8771103 - Flags: review?(gfritzsche) → review+
Attachment #8771097 - Attachment is obsolete: true
Attachment #8772410 - Flags: review?(gfritzsche)
Attachment #8771098 - Attachment is obsolete: true
Attachment #8772411 - Flags: review+
Attachment #8771103 - Attachment is obsolete: true
Attachment #8772412 - Flags: review+
Attachment #8772410 - Flags: review?(gfritzsche) → review+
One last thing: We need to update the documentation for the payload format change (main-ping.rst).
Depends on: 1287101
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc61e32ed508&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-classifiedState=unclassified&selectedJob=24224833

So, it's not going too swimmingly. There are other tests that are failing, and at least one of them is reproducible locally.

Apparently NewHistogram()-made histograms do not get added to the gHistogramsMap. This is a problem as JSHistogram_Add (and friends) look for the histogram there. :(

DXR [1] says only two tests use it (the two tests that are failing on me (the third's a timeout)). So why do we support it? I thought the whole reason we had special handling for TEST_* histograms was to support tests.

[1]: https://dxr.mozilla.org/mozilla-central/search?q=ewHistogram&case=true&=mozilla-central
Fixing for test failure.
Attachment #8771095 - Attachment is obsolete: true
Attachment #8773780 - Flags: review+
Small testfix
Attachment #8772410 - Attachment is obsolete: true
Attachment #8773781 - Flags: review+
testfix
Attachment #8770193 - Attachment is obsolete: true
Attachment #8773782 - Flags: review+
testfix - need to run browser_use_counters in non-e10s mode as well.
Attachment #8771099 - Attachment is obsolete: true
Attachment #8775162 - Flags: review+
testfix
Attachment #8772412 - Attachment is obsolete: true
Attachment #8775163 - Flags: review+
With that, try seems pleased [1]. Now to get the dependencies taken care of.

...and I should give some thought to how to verify that this is working once it's in. I guess a simple notebook checking for child data from nightlies with the appropriate buildid would do, since it's an all or nothing kind of thing. Either this works and moztelemetry works and we can see all of the child data streaming in immediately. Or something exploded and we see absolutely no child data streaming in for that build.

No pressure.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a34e5eeb1b50&selectedJob=24535443
We should also monitor the evolution of a few child data points (simple measuremements & content process histograms) to verify they are not regressing.

Additionally, i can check with John Dorlus to get a QA pass on this.

Also, is comment 91 better solved in a new bug or here?
Is about:telemetry still working when this lands?
about:telemetry is not guaranteed to be working with just this. Well, raw view works. That's what bug 1277504 is for.

comment 91 would likely get swamped here and would be better as its own bug.
No longer blocks: 1277504
Depends on: 1277504
No longer blocks: 1290027
Depends on: 1290027
In the previous try run there was an odd crash in internal_GetEnumById[1]. I couldn't reproduce it locally. With code inspection, the only likely possibility was that AccumulateChild was being called with an invalid telemetry ID.

So here's an extra patch ensuring that the IDs coming across IPC are valid. Try [2] seems to be assuaged.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a34e5eeb1b50&selectedJob=24609454
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=09c731eeb58c
Attachment #8776914 - Flags: review?(gfritzsche)
Comment on attachment 8776914 [details] [diff] [review]
0011-bug-1218576-Ensure-Telemetry-IDs-coming-from-IPC-are.patch

Review of attachment 8776914 [details] [diff] [review]:
-----------------------------------------------------------------

This is good to me with the below fixed.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +2069,5 @@
>      return;
>    }
>    for (uint32_t i = 0; i < aAccumulations.Length(); ++i) {
> +    MOZ_ASSERT(internal_IsHistogramEnumId(aAccumulations[i].mId));
> +    if (!internal_IsHistogramEnumId(aAccumulations[i].mId)) {

Let's not call the function twice and add a message:
const bool isValid = internal_IsHistogramEnumId(...);
MOZ_ASSERT(isValid, "Received invalid enum id.");
if (!isValid) ...

@@ +2087,5 @@
>      return;
>    }
>    for (uint32_t i = 0; i < aAccumulations.Length(); ++i) {
> +    MOZ_ASSERT(internal_IsHistogramEnumId(aAccumulations[i].mId));
> +    if (!internal_IsHistogramEnumId(aAccumulations[i].mId)) {

Ditto.
Attachment #8776914 - Flags: review?(gfritzsche) → review+
Attachment #8776914 - Attachment is obsolete: true
Attachment #8777042 - Flags: review+
Attachment #8773780 - Attachment is obsolete: true
Attachment #8781667 - Flags: review+
Attachment #8771096 - Attachment is obsolete: true
Attachment #8781668 - Flags: review+
This one changed sufficiently in rebase to warrant a second look.
Attachment #8781673 - Flags: review?(gfritzsche)
Attachment #8773782 - Attachment is obsolete: true
Attachment #8775163 - Attachment is obsolete: true
Attachment #8781679 - Flags: review+
Attachment #8781682 - Flags: review?(gfritzsche)
Comment on attachment 8781681 [details] [diff] [review]
0012-bug-1218576-Ensure-we-can-record-before-trying-to-re.patch

Review of attachment 8781681 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, good catch.
Attachment #8781681 - Flags: review?(gfritzsche) → review+
Comment on attachment 8781667 [details] [diff] [review]
0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch

Review of attachment 8781667 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1781,4 @@
>  internal_Accumulate(mozilla::Telemetry::ID aID,
>                      const nsCString& aKey, uint32_t aSample)
>  {
> +  if (!gInitDone || !internal_CanRecordBase() ||

Why does this check for gInitDone, while the non-keyed version above does not?
Comment on attachment 8781673 [details] [diff] [review]
0006-bug-1218576-Support-remote-accumulation-via-JS-histo.patch

Review of attachment 8781673 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1144,4 @@
>  ////////////////////////////////////////////////////////////////////////
>  ////////////////////////////////////////////////////////////////////////
>  //
> +// PRIVATE: functions related to addon histograms

Review note: This section just got moved, no changes.

@@ +1245,5 @@
> +
> +////////////////////////////////////////////////////////////////////////
> +////////////////////////////////////////////////////////////////////////
> +//
> +// PRIVATE: thread-unsafe helpers for the external interface

Review note: This section just got moved, no changes.

@@ +1472,5 @@
> +    nsresult rv = internal_GetHistogramEnumId(h->histogram_name().c_str(), &id);
> +    if (!NS_SUCCEEDED(rv)) {
> +      return false;
> +    }
> +    return internal_RemoteAccumulate(id, value);

Can't we overload `internal_Accumulate()` so we can directly call it from here?
That way we can keep the parent/content branching logic contained in one place.

@@ +1726,5 @@
>  
> +  NS_ConvertUTF16toUTF8 utf8Key(key);
> +  mozilla::Telemetry::ID id;
> +  if (NS_FAILED(keyed->GetEnumId(id)) ||
> +      !internal_RemoteAccumulate(id, utf8Key, value)) {

As above it would be clearer.
if (!XRE_IsParentProcess()) ...

@@ -1621,4 @@
>  ////////////////////////////////////////////////////////////////////////
>  ////////////////////////////////////////////////////////////////////////
>  //
> -// PRIVATE: functions related to addon histograms

Review note: This section just got moved, no changes.

@@ -1722,5 @@
> -
> -////////////////////////////////////////////////////////////////////////
> -////////////////////////////////////////////////////////////////////////
> -//
> -// PRIVATE: thread-unsafe helpers for the external interface

Review note: This section just got moved, no changes.
Attachment #8781673 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8781682 [details] [diff] [review]
0013-bug-1218576-Aggregate-child-categorical-histograms-i.patch

Review of attachment 8781682 [details] [diff] [review]:
-----------------------------------------------------------------

With the different code paths we have, i'd like to see test_ChildHistograms.js updated with coverage for at least categorical child histograms too.
Lets file a follow-up for more complete coverage in test_ChildHistograms.js for all the types, to avoid breaking things later.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1434,4 @@
>        return false;
>      }
>  
> +    if (NS_FAILED(gHistograms[id].label_id(NS_ConvertUTF16toUTF8(label).get(), &value))) {

Readability: This gets hard to read. Lets make this `if (NS_FAILED(rv)) ...` and move the rest to the preceding line.
Attachment #8781682 - Flags: review?(gfritzsche) → feedback+
Blocks: 1296283
(In reply to Georg Fritzsche [:gfritzsche] from comment #120)
> Comment on attachment 8781667 [details] [diff] [review]
> 0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch
> 
> Review of attachment 8781667 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryHistogram.cpp
> @@ +1781,4 @@
> >  internal_Accumulate(mozilla::Telemetry::ID aID,
> >                      const nsCString& aKey, uint32_t aSample)
> >  {
> > +  if (!gInitDone || !internal_CanRecordBase() ||
> 
> Why does this check for gInitDone, while the non-keyed version above does
> not?

No idea. I had a similar question when I saw it, but decided it was best to leave well enough alone :S

Filed Bug 1296283 for the test (except for the categorical, which I'll add as part of this bug)
I left JSKeyedHistogram_Add alone as keyed histograms are their own animal of special cases. Trying to add an internal_Accumulate specifically for a keyed histogram in this way resulted in a lot more confusion than the code as written, so I took it out.
Attachment #8781673 - Attachment is obsolete: true
Attachment #8782636 - Flags: review?(gfritzsche)
Adding categorical histogram tests to test_ChildHistograms.js will be a fourteenth (!!) patch, tomorrow.
Attachment #8781682 - Attachment is obsolete: true
Attachment #8782637 - Flags: review?(gfritzsche)
Comment on attachment 8782636 [details] [diff] [review]
0006-bug-1218576-Support-remote-accumulation-via-JS-histo.patch

Review of attachment 8782636 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1352,5 @@
> +
> +void
> +internal_Accumulate(Histogram& aHistogram, uint32_t aSample)
> +{
> +  if (!XRE_IsParentProcess()) {

For readability i'd prefer:
if (XRE_IsParentProcess()) {
  ...
  return;
}

...

@@ +1735,5 @@
> +  mozilla::Telemetry::ID id;
> +  if (NS_FAILED(keyed->GetEnumId(id)) ||
> +      !internal_RemoteAccumulate(id, utf8Key, value)) {
> +    keyed->Add(utf8Key, value);
> +  }

I think this could be a simple `internal_Accumulate(keyed, utfKey, value)` helper.
The background here is that we'd then have the `internal_RemoteAccumulate()` branching logic contained in four very similar functions.
Attachment #8782636 - Flags: review?(gfritzsche) → feedback+
Attachment #8782637 - Flags: review?(gfritzsche) → review+
Understood. Changes made.
Attachment #8782636 - Attachment is obsolete: true
Attachment #8782939 - Flags: review?(gfritzsche)
Attachment #8782943 - Flags: review?(gfritzsche)
Comment on attachment 8782939 [details] [diff] [review]
0006-bug-1218576-Support-remote-accumulation-via-JS-histo.patch

Review of attachment 8782939 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8782939 - Flags: review?(gfritzsche) → review+
Comment on attachment 8782943 [details] [diff] [review]
0014-bug-1218576-Test-child-categorical-histogram-aggrega.patch

Review of attachment 8782943 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8782943 - Flags: review?(gfritzsche) → review+
Attachment #8781668 - Attachment is obsolete: true
Attachment #8783654 - Flags: review+
Attachment #8783654 - Attachment is obsolete: true
Attachment #8783655 - Flags: review+
This currently has linting errors (running `mach eslint toolkit/components/telemetry/`):
.../toolkit/components/telemetry/tests/unit/test_ChildHistograms.js
  94:43  error  A space is required after ','  comma-spacing
Comment on attachment 8781667 [details] [diff] [review]
0001-bug-1218576-Accumulate-child-histograms-in-the-paren.patch

Review of attachment 8781667 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +2559,5 @@
> +    }
> +  }
> +
> +  mozilla::dom::ContentChild* contentChild = mozilla::dom::ContentChild::GetSingleton();
> +  if (!NS_WARN_IF(!contentChild)) {

Nit: Having had to look at this file, this reads a bit confusing.
Can we put the NS_WARN_IF() on a separate line before the if?

@@ +2562,5 @@
> +  mozilla::dom::ContentChild* contentChild = mozilla::dom::ContentChild::GetSingleton();
> +  if (!NS_WARN_IF(!contentChild)) {
> +    if (accumulationsToSend.Length()) {
> +      mozilla::Unused <<
> +        NS_WARN_IF(contentChild->SendAccumulateChildHistogram(accumulationsToSend));

This seems to print warnings for me on each timer activation.
This should be `NS_WARN_IF(!...)`.

@@ +2566,5 @@
> +        NS_WARN_IF(contentChild->SendAccumulateChildHistogram(accumulationsToSend));
> +    }
> +    if (keyedAccumulationsToSend.Length()) {
> +      mozilla::Unused <<
> +        NS_WARN_IF(contentChild->SendAccumulateChildKeyedHistogram(keyedAccumulationsToSend));

Ditto.
Fun Fact: ./mach eslint on windows doesn't identify the error for me.
Attachment #8782943 - Attachment is obsolete: true
Attachment #8783997 - Flags: review+
Blocks: 1298887
Pushed by georg.fritzsche@googlemail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/74b9b3f4ce25
Accumulate child histograms in the parent process r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/2310da8e4942
Pre-allocate child keyed histograms in the parent process r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/cd8049475224
Remove requestChildPayloads r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e4a7610406a
Move aggregated child telemetry to processes.content r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d375de64477
Support remote accumulation via JS histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/59b0fb3fa328
Update test_ChildHistograms for child telemetry aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/0397f79f86ad
Update test browser_use_counters.js to use aggregated child telemetry r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/283cee5fb8a7
Assert we're the parent process when clearing histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/7f8e3d336e2f
Ensure remaining batched telemetry is flushed on content process shutdown r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/aee17ede1c0e
Ensure Telemetry IDs coming from IPC are valid r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/80b0680f916f
Ensure we can record before trying to record categorical histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/3c5d66ad69a0
Aggregate child categorical histograms in the parent process. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ce72d25d0c5b
Test child categorical histogram aggregation r=gfritzsche
Backed out for build bustage:

    2c7ef95c2785 SH Backed out changeset 74b9b3f4ce25 (bug 1218576) for build bustage. r=backout on a CLOSED TREE
    d9deb379d830 SH Backed out changeset 7537a79e91a4 (bug 1218577)
    b64e58e453e1 SH Backed out changeset 2310da8e4942 (bug 1218576)
    c249998a085f SH Backed out changeset cd8049475224 (bug 1218576)
    6a4b7a63125d SH Backed out changeset 6e4a7610406a (bug 1218576)
    f327e2486a4b SH Backed out changeset 2d375de64477 (bug 1218576)
    ac368b44eb34 SH Backed out changeset 59b0fb3fa328 (bug 1218576)
    1092ce552e86 SH Backed out changeset 0397f79f86ad (bug 1218576)
    70b93415fbef SH Backed out changeset 283cee5fb8a7 (bug 1218576)
    7e6a9af6246b SH Backed out changeset 7f8e3d336e2f (bug 1218576)
    20775bc718db SH Backed out changeset aee17ede1c0e (bug 1218576)
    9470c7a31cf2 SH Backed out changeset 80b0680f916f (bug 1218576)
    452bab4b12e9 SH Backed out changeset 3c5d66ad69a0 (bug 1218576)
    53f5c4b914c4 SH Backed out changeset ce72d25d0c5b (bug 1218576)
    188b32f8df91 SH Backed out changeset 3c606cacdd1d (bug 1277504)
    15ad29361e48 SH Backed out changeset fe56ce42df42 (bug 1290027)

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=fe56ce42df420209cc940b2e88b676495c933d00
Flags: needinfo?(chutten)
Just fixing the bug number in the commit message.
Attachment #8783655 - Attachment is obsolete: true
Attachment #8786401 - Flags: review+
The offending lower-case 'u' has been replaced with an imposter of our choosing.

linux try build in progress here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7b25a048e7a
Flags: needinfo?(chutten)
Attachment #8786403 - Flags: review?(gfritzsche)
Comment on attachment 8786403 [details] [diff] [review]
0015-bug-1218576-Change-unused.h-to-Unused.h-which-matter.patch

Review of attachment 8786403 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8786403 - Flags: review?(gfritzsche) → review+
Pushed by georg.fritzsche@googlemail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d75e044ec95
Accumulate child histograms in the parent process r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/a7907f9f7bab
Support subsession hgrams for child aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ee46537edfbc
Pre-allocate child keyed histograms in the parent process r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/5bb29adfd2f3
Remove requestChildPayloads r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/c56206f87ddf
Move aggregated child telemetry to processes.content r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/80df18726a87
Support remote accumulation via JS histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/0fa32203e30a
Update test_ChildHistograms for child telemetry aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/c716caa64cb4
Update test browser_use_counters.js to use aggregated child telemetry r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ecce735b1bf
Assert we're the parent process when clearing histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/355d9c21953c
Ensure remaining batched telemetry is flushed on content process shutdown r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ccbf45fbf4e8
Ensure Telemetry IDs coming from IPC are valid r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/25905e31cba5
Ensure we can record before trying to record categorical histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/009721cbd071
Aggregate child categorical histograms in the parent process. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/548b9c783ce5
Test child categorical histogram aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b23af2bc178
Change unused.h to Unused.h, which matters to some filesystems. r=gfritzsche
Backout by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc16519932db
Backed out changeset 9b23af2bc178 
https://hg.mozilla.org/integration/mozilla-inbound/rev/24dbbaf9d439
Backed out changeset 548b9c783ce5 
https://hg.mozilla.org/integration/mozilla-inbound/rev/e151edbf5df4
Backed out changeset 009721cbd071 
https://hg.mozilla.org/integration/mozilla-inbound/rev/45d412c18b1d
Backed out changeset 25905e31cba5 
https://hg.mozilla.org/integration/mozilla-inbound/rev/254d5efcd779
Backed out changeset ccbf45fbf4e8 
https://hg.mozilla.org/integration/mozilla-inbound/rev/525719d23c63
Backed out changeset 355d9c21953c 
https://hg.mozilla.org/integration/mozilla-inbound/rev/c42fa089230b
Backed out changeset 5ecce735b1bf 
https://hg.mozilla.org/integration/mozilla-inbound/rev/f14b41c536bd
Backed out changeset c716caa64cb4 
https://hg.mozilla.org/integration/mozilla-inbound/rev/b946234ac570
Backed out changeset 0fa32203e30a 
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa6b6a5aa63f
Backed out changeset 80df18726a87 
https://hg.mozilla.org/integration/mozilla-inbound/rev/87518e260a1c
Backed out changeset c56206f87ddf 
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd3510eccff2
Backed out changeset 5bb29adfd2f3 
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c694a30bcfe
Backed out changeset ee46537edfbc 
https://hg.mozilla.org/integration/mozilla-inbound/rev/ddf6133b120d
Backed out changeset a7907f9f7bab 
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab9c1d2e64b0
Backed out changeset 7d75e044ec95
Backed out for XPCshell crashes:

https://hg.mozilla.org/integration/mozilla-inbound/rev/ab9c1d2e64b065da15f5808355d17a010884ba8a
https://hg.mozilla.org/integration/mozilla-inbound/rev/ddf6133b120d6e1820eec2a99755bac92cd34807
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c694a30bcfeac5efe3ea123a3c15298ae53dbbf
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd3510eccff233e5094518c16cc5526336ff005f
https://hg.mozilla.org/integration/mozilla-inbound/rev/87518e260a1c840a34d2167cd76aa5a1090de470
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa6b6a5aa63f372ffa6fde893f1d5b1ae6a888ce
https://hg.mozilla.org/integration/mozilla-inbound/rev/b946234ac570a4d2871a262990316cd69dc307cc
https://hg.mozilla.org/integration/mozilla-inbound/rev/f14b41c536bd01a16068088c4d31237d2d4cc832
https://hg.mozilla.org/integration/mozilla-inbound/rev/c42fa089230be7ddedb073cf7e1fe660a175edce
https://hg.mozilla.org/integration/mozilla-inbound/rev/525719d23c639418db640cc197723076ae469eb4
https://hg.mozilla.org/integration/mozilla-inbound/rev/254d5efcd779fce5eda7176bc1f97596a94d10c6
https://hg.mozilla.org/integration/mozilla-inbound/rev/45d412c18b1da9d05245bfa33210a99d8378e09b
https://hg.mozilla.org/integration/mozilla-inbound/rev/e151edbf5df4d7ee549a708bebdccfffca5672ff
https://hg.mozilla.org/integration/mozilla-inbound/rev/24dbbaf9d439402423422063362becb64948f481
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc16519932dbaa8d0e6f415542523bceba9e4f7e

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=9b23af2bc178d79b7bdac35f56add8e557c6c937
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=34968886&repo=mozilla-inbound

17:37:27  WARNING -  PROCESS-CRASH | netwerk/test/unit_ipc/test_httpsuspend_wrap.js | application crashed [@ mozilla::ipc::MessageChannel::AssertWorkerThread]
17:37:27     INFO -  Crash dump filename: /tmp/xpc-other-9fR1KY/21f3f5ba-8b1a-bbc8-51ba515a-44b8de5e.dmp
17:37:27     INFO -  Operating system: Linux
17:37:27     INFO -                    0.0.0 Linux 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64
17:37:27     INFO -  CPU: amd64
17:37:27     INFO -       family 6 model 62 stepping 4
17:37:27     INFO -       1 CPU
17:37:27     INFO -  Crash reason:  SIGSEGV
17:37:27     INFO -  Crash address: 0x0
17:37:27     INFO -  Thread 10 (crashed)
17:37:27     INFO -   0  libxul.so!mozilla::ipc::MessageChannel::AssertWorkerThread [MessageChannel.h:9b23af2bc178 : 445 + 0x0]
17:37:27     INFO -      rbx = 0x0000000000000001   r12 = 0x00007f13d0740900
17:37:27     INFO -      r13 = 0x0000000000000001   r14 = 0x00007f13d51fe4f8
17:37:27     INFO -      r15 = 0x00007f13d51fe5b8   rip = 0x00007f13e93f90ae
17:37:27     INFO -      rsp = 0x00007f13d51fe480   rbp = 0x00007f13d51fe490
17:37:27     INFO -      Found by: given as instruction pointer in context
17:37:27     INFO -   1  libxul.so!mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame [MessageChannel.cpp:9b23af2bc178 : 230 + 0x7]
17:37:27     INFO -      rbx = 0x00007f13d51fe4f0   r12 = 0x00007f13d0740900
17:37:27     INFO -      r13 = 0x0000000000000001   r14 = 0x00007f13d51fe4f8
17:37:27     INFO -      r15 = 0x00007f13d51fe5b8   rip = 0x00007f13e94040e3
17:37:27     INFO -      rsp = 0x00007f13d51fe4a0   rbp = 0x00007f13d51fe4e0
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   2  libxul.so!mozilla::ipc::MessageChannel::Send [MessageChannel.cpp:9b23af2bc178 : 772 + 0x7]
17:37:27     INFO -      rbx = 0x00007f13df375088   r12 = 0x00007f13d0740900
17:37:27     INFO -      r13 = 0x00007f13d51fe4f0   r14 = 0x00007f13d51fe4f8
17:37:27     INFO -      r15 = 0x00007f13d51fe5b8   rip = 0x00007f13e940babe
17:37:27     INFO -      rsp = 0x00007f13d51fe4f0   rbp = 0x00007f13d51fe530
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   3  libxul.so!mozilla::dom::PContentChild::SendAccumulateChildHistogram [PContentChild.cpp:9b23af2bc178 : 6008 + 0xb]
17:37:27     INFO -      rbx = 0x00007f13df375020   r12 = 0x00007f13d51fe558
17:37:27     INFO -      r13 = 0x00007f13d0740900   r14 = 0x00007f13d51fe560
17:37:27     INFO -      r15 = 0x00007f13d51fe5b8   rip = 0x00007f13e96656d9
17:37:27     INFO -      rsp = 0x00007f13d51fe540   rbp = 0x00007f13d51fe5a0
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   4  libxul.so!TelemetryHistogram::IPCTimerFired [TelemetryHistogram.cpp:9b23af2bc178 : 2651 + 0xa]
17:37:27     INFO -      rbx = 0x00007f13d51fe5b8   r12 = 0x00007f13d51fe5c0
17:37:27     INFO -      r13 = 0x00007f13df375020   r14 = 0x00007f13d51fe5c8
17:37:27     INFO -      r15 = 0x00007f13d51fe640   rip = 0x00007f13eb4b5b24
17:37:27     INFO -      rsp = 0x00007f13d51fe5b0   rbp = 0x00007f13d51fe600
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   5  libxul.so!nsTimerImpl::Fire [nsTimerImpl.cpp:9b23af2bc178 : 521 + 0x9]
17:37:27     INFO -      rbx = 0x00007f13d0842580   r12 = 0x00007f13eb4b5a59
17:37:27     INFO -      r13 = 0x00007f13d51fe602   r14 = 0x00007f13d08425c0
17:37:27     INFO -      r15 = 0x00007f13d51fe640   rip = 0x00007f13e906f493
17:37:27     INFO -      rsp = 0x00007f13d51fe610   rbp = 0x00007f13d51fe680
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   6  libxul.so!nsTimerEvent::Run [TimerThread.cpp:9b23af2bc178 : 286 + 0xf]
17:37:27     INFO -      rbx = 0x00007f13d0746020   r12 = 0x00007f13d0746048
17:37:27     INFO -      r13 = 0x00007f13d51fe6f0   r14 = 0x00007f13d51fe6e8
17:37:27     INFO -      r15 = 0x00007f13d51fe77f   rip = 0x00007f13e906d66f
17:37:27     INFO -      rsp = 0x00007f13d51fe690   rbp = 0x00007f13d51fe6c0
17:37:27     INFO -      Found by: call frame info
17:37:27     INFO -   7  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:9b23af2bc178 : 1058 + 0x10]
17:37:27     INFO -      rbx = 0x00007f13df3f9400   r12 = 0x00007f13d51fe6f8
17:37:27     INFO -      r13 = 0x00007f13d51fe6f0   r14 = 0x00007f13d51fe6e8
17:37:27     INFO -      r15 = 0x00007f13d51fe77f   rip = 0x00007f13e9068f58
17:37:28     INFO -      rsp = 0x00007f13d51fe6d0   rbp = 0x00007f13d51fe760
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -   8  libxul.so!NS_ProcessNextEvent [nsThreadUtils.cpp:9b23af2bc178 : 290 + 0xc]
17:37:28     INFO -      rbx = 0x0000000000000001   r12 = 0x00007f13d51fe820
17:37:28     INFO -      r13 = 0x00007f13d51fe818   r14 = 0x00007f13df3c09e4
17:37:28     INFO -      r15 = 0x000007db25d6e831   rip = 0x00007f13e908f2be
17:37:28     INFO -      rsp = 0x00007f13d51fe770   rbp = 0x00007f13d51fe790
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -   9  libxul.so!mozilla::net::nsSocketTransportService::Run [nsSocketTransportService2.cpp:9b23af2bc178 : 908 + 0x10]
17:37:28     INFO -      rbx = 0x00007f13df3c0900   r12 = 0x00007f13d51fe820
17:37:28     INFO -      r13 = 0x00007f13d51fe818   r14 = 0x00007f13df3c09e4
17:37:28     INFO -      r15 = 0x000007db25d6e831   rip = 0x00007f13e91299e5
17:37:28     INFO -      rsp = 0x00007f13d51fe7a0   rbp = 0x00007f13d51fe860
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  10  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:9b23af2bc178 : 1058 + 0x10]
17:37:28     INFO -      rbx = 0x00007f13df3f9400   r12 = 0x00007f13d51fe898
17:37:28     INFO -      r13 = 0x00007f13d51fe890   r14 = 0x00007f13d51fe888
17:37:28     INFO -      r15 = 0x00007f13d51fe91f   rip = 0x00007f13e9068f58
17:37:28     INFO -      rsp = 0x00007f13d51fe870   rbp = 0x00007f13d51fe900
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  11  libxul.so!NS_ProcessNextEvent [nsThreadUtils.cpp:9b23af2bc178 : 290 + 0xc]
17:37:28     INFO -      rbx = 0x0000000000000000   r12 = 0x00007f13dae036e0
17:37:28     INFO -      r13 = 0x00007f13d928f7b0   r14 = 0x00007f13df3f9400
17:37:28     INFO -      r15 = 0x00007f13df3f9400   rip = 0x00007f13e908f2be
17:37:28     INFO -      rsp = 0x00007f13d51fe910   rbp = 0x00007f13d51fe930
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  12  libxul.so!mozilla::ipc::MessagePumpForNonMainThreads::Run [MessagePump.cpp:9b23af2bc178 : 338 + 0x9]
17:37:28     INFO -      rbx = 0x00007f13d928f780   r12 = 0x00007f13dae036e0
17:37:28     INFO -      r13 = 0x00007f13d928f7b0   r14 = 0x00007f13df3f9400
17:37:28     INFO -      r15 = 0x00007f13df3f9400   rip = 0x00007f13e9400d42
17:37:28     INFO -      rsp = 0x00007f13d51fe940   rbp = 0x00007f13d51fe9a0
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  13  libxul.so!MessageLoop::RunInternal [message_loop.cc:9b23af2bc178 : 232 + 0x16]
17:37:28     INFO -      rbx = 0x00007f13dae036e0   r12 = 0x00007f13d51fea40
17:37:28     INFO -      r13 = 0x00007f13d51fea50   r14 = 0x00007f13dae036e0
17:37:28     INFO -      r15 = 0x00007f13df3f9420   rip = 0x00007f13e93dd2e1
17:37:28     INFO -      rsp = 0x00007f13d51fe9b0   rbp = 0x00007f13d51fe9e0
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  14  libxul.so!MessageLoop::Run [message_loop.cc:9b23af2bc178 : 225 + 0x7]
17:37:28     INFO -      rbx = 0x00007f13dae036e0   r12 = 0x00007f13d51fea40
17:37:28     INFO -      r13 = 0x00007f13d51fea50   r14 = 0x00007f13dae036e0
17:37:28     INFO -      r15 = 0x00007f13df3f9420   rip = 0x00007f13e93dd308
17:37:28     INFO -      rsp = 0x00007f13d51fe9f0   rbp = 0x00007f13d51fea20
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  15  libxul.so!nsThread::ThreadFunc [nsThread.cpp:9b23af2bc178 : 459 + 0x7]
17:37:28     INFO -      rbx = 0x00007f13df3f9400   r12 = 0x00007f13d51fea40
17:37:28     INFO -      r13 = 0x00007f13d51fea50   r14 = 0x00007f13dae036e0
17:37:28     INFO -      r15 = 0x00007f13df3f9420   rip = 0x00007f13e9069b6c
17:37:28     INFO -      rsp = 0x00007f13d51fea30   rbp = 0x00007f13d51fea90
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  16  libnspr4.so!_pt_root [ptthread.c:9b23af2bc178 : 216 + 0x6]
17:37:28     INFO -      rbx = 0x00007f13df35cfe0   r12 = 0x0000000000000000
17:37:28     INFO -      r13 = 0x0000000000002788   r14 = 0x00007f13d51ff700
17:37:28     INFO -      r15 = 0x00007f13d51ff520   rip = 0x00007f13ee5bc6bf
17:37:28     INFO -      rsp = 0x00007f13d51feaa0   rbp = 0x00007f13d51feae0
17:37:28     INFO -      Found by: call frame info
17:37:28     INFO -  17  libpthread-2.15.so + 0x7e99
17:37:28     INFO -      rbx = 0x0000000000000000   r12 = 0x00007ffe17dbbc58
17:37:28     INFO -      r13 = 0x00007f13d51ff9c0   r14 = 0x0000000000000000
17:37:28     INFO -      r15 = 0x0000000000000003   rip = 0x00007f13ee7d8e9a
17:37:28     INFO -      rsp = 0x00007f13d51feaf0   rbp = 0x0000000000000000
17:37:28     INFO -      Found by: call frame info
Flags: needinfo?(chutten)
Does not reproduce on Windows, did not happen on linux32 run over here [1]... I'm in process of getting a one-click loaner to hopefully get some sort of information about how this is exploding...

...did the Assert[2] fail, or did the assert contain a null deref? Because, if it's the former, it's wrong. The timer fired, which means it's using the TimerThread, which is just fine (happens all the time). I suppose mWorkerLoopID could be wrong, but why?

From this printout, I'm assuming it's the latter. MessageLoop::current()->id() includes a null deref which means MessageLoop::current() is returning nullptr which means the chromium IPC code has no current MessageLoop... which might mean the process is in process of shutting down? I don't know why else it'd be 0.

This is all terribly confusing.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=092cebba0ebc
[2]: https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.h?q=AssertWorkerThread&redirect_type=direct#443
Flags: needinfo?(chutten)
Apparently it's an assertion failure and can reproduce on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3d32e7f36e8 

I'm bringing up a linux64 dev environment in a vm. Let's see if I can reproduce this locally.
Depends on: 1299312
:froydnj and :jld on IRC twigged me to how this could be a threading issue because nsTimer isn't threadsafe. It assumes all operations are happening on the thread that created it (or its target)[1], and IPC using PContent requires that we use the main thread.

My theory is now as follows: for the failing tests, there is a non-main thread racing with the main thread to be the first to record child telemetry. (there is no requirement that accumulations must happen on the main thread. Telemetry is threadsafe(ish)) If the non-main thread Accumulate()s first, then the IPC timer is registered on the wrong thread, asserts fail, cats and dogs are living together, and the world ends. Sorta. Sometimes.

So, we clearly must only interact with gIPCTimer on the main thread, regardless of whether it is the cause of all of this nonsense. While I'm thinking of the proper Software Engineering Solution (where the locks should sit and whatnot), I've coded up a little hack that MOZ_ASSERTs in IPCTimerFired that we're NS_IsMainThread(), and refuses to arm gIPCTimer unless we're NS_IsMainThread(). This is up for try here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b88564939f5e

[1]: https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsITimer.idl#68
A simple change to ensure the timer is only messed with on the main thread.
Attachment #8787320 - Flags: review?(nfroyd)
Attachment #8787320 - Flags: feedback?(gfritzsche)
Comment on attachment 8787320 [details] [diff] [review]
0016-bug-1218576-Ensure-IPCTimer-is-on-the-main-thread.-r.patch

Review of attachment 8787320 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +2640,4 @@
>  void
>  TelemetryHistogram::IPCTimerFired(nsITimer* aTimer, void* aClosure)
>  {
> +  MOZ_ASSERT(NS_IsMainThread());

So with this, we can now drop the mutex below.
Attachment #8787320 - Flags: feedback?(gfritzsche) → feedback+
(In reply to Georg Fritzsche [:gfritzsche] from comment #149)
> Comment on attachment 8787320 [details] [diff] [review]
> 0016-bug-1218576-Ensure-IPCTimer-is-on-the-main-thread.-r.patch
> 
> Review of attachment 8787320 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryHistogram.cpp
> @@ +2640,4 @@
> >  void
> >  TelemetryHistogram::IPCTimerFired(nsITimer* aTimer, void* aClosure)
> >  {
> > +  MOZ_ASSERT(NS_IsMainThread());
> 
> So with this, we can now drop the mutex below.

Not unless nsTArrays are threadsafe. There's nothing guaranteeing that g*ccumulations aren't being accessed on other threads under the mutex's protection. See internal_RemoteAccumulate.
(In reply to Chris H-C :chutten from comment #150)
> Not unless nsTArrays are threadsafe. There's nothing guaranteeing that
> g*ccumulations aren't being accessed on other threads under the mutex's
> protection. See internal_RemoteAccumulate.

You are right of course. This was my weekend brain being short-sighted.
Comment on attachment 8787320 [details] [diff] [review]
0016-bug-1218576-Ensure-IPCTimer-is-on-the-main-thread.-r.patch

Review of attachment 8787320 [details] [diff] [review]:
-----------------------------------------------------------------

Works for me.  I'd like to see a patch with Georg's feedback applied, and main thread asserts liberally sprinkled around.
Attachment #8787320 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #152)
> Works for me.  I'd like to see a patch with Georg's feedback applied, and
> main thread asserts liberally sprinkled around.

Oh, I realized one thing that bothers me about this patch; are we going to send a bunch of events to the main thread all to arm the timer, and then only have one of them actually do anything?  That seems wasteful.  We should have something like "did we start arming the timer", right?
Flags: needinfo?(chutten)
I did have `if (gIPCTimerArmed) { return; }` at the top of internal_armIPCTimer() in an unsubmitted version of this patch. That ought to take care of it? I imagine the main thread operates at a higher priority than other threads so it ought not to be interrupted during internal_armIPCTimerMainThread often enough to warrant a different flag.
Flags: needinfo?(chutten) → needinfo?(nfroyd)
(In reply to Chris H-C :chutten from comment #154)
> I did have `if (gIPCTimerArmed) { return; }` at the top of
> internal_armIPCTimer() in an unsubmitted version of this patch. That ought
> to take care of it? I imagine the main thread operates at a higher priority
> than other threads so it ought not to be interrupted during
> internal_armIPCTimerMainThread often enough to warrant a different flag.

Even if you can assume the main thread operates at a higher priority, you don't want it to starve out other threads, and on a multicore machine, you're going to have other threads running simultaneously, anyway, right?  So the following sequence is completely plausible:

MainThread: Doing some long-running operation (e.g. content JS), or has a very full queue.
ThreadA: Requests timer to be armed.  Event is stuffed in the main thread's queue.
ThreadB: Requests timer to be armed.  Event is stuffed in the main thread's queue.
[...repeat ad nauseum...]
MainThread: discovers many "arm timer" events, only to really do anything with the first one.

You cannot assume anything about the ordering of events between threads unless you're taking steps to enforce said ordering.

Also, gIPCTimerArmed needs to be an Atomic<bool, Relaxed>; apologies for missing this the first time around.
Flags: needinfo?(nfroyd)
This ought to sort it.
Attachment #8787320 - Attachment is obsolete: true
Attachment #8787737 - Flags: review?(nfroyd)
Attachment #8787737 - Flags: feedback+
Comment on attachment 8787737 [details] [diff] [review]
0016-bug-1218576-Ensure-IPCTimer-is-on-the-main-thread.-r.patch

Review of attachment 8787737 [details] [diff] [review]:
-----------------------------------------------------------------

Still racy, and in a bad way. :(

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1296,5 @@
>  
> +void internal_armIPCTimer()
> +{
> +  if (NS_IsMainThread()) {
> +    internal_armIPCTimerMainThread();

Is it deliberate that we're calling the arming function here without the lock, and calling the arming function in the non-main-thread case with the lock?  Doing things this way opens the possibility for an intermittent leak, like so:

MainThread: calls internal_armIPCTimerMainThread()
MainThread: sees gIPCTimerArmed is false, proceeds
MainThread: interrupted after checking gIPCTimer, but before the CallCreateInstance call
OtherThread: calls internal_armIPCTimerMainThread()
OtherThread: returns from internal_armIPCTimerMainThread()
MainThread: creates a *second* timer, leaking the previous one.

There are also potentially weird scenarios like a thread trying to initialize the timer after it's already been initialized by some other thread:

MainThread: calls internal_armIPCTimerMainThread()
MainThread: sees gIPCTimerArmed is false, proceeds
MainThread: interrupted after CallCreateInstance, but before checking gIPCTimer for success.
OtherThread: calls internal_armIPCTimerMainThread()
OtherThread: returns from internal_armIPCTimerMainThread()
MainThread: checks gIPCTimer, sees it's non-null (but *not* the one the main thread created earlier), initializes it

which I think is bad news.
Attachment #8787737 - Flags: review?(nfroyd)
internal_* methods are only called when holding the lock (excepting rare JS-related use-cases discussed at length in the de-racing Telemetry bug 1258183), so the main thread case already held the lock. What this does is ensure the non-main-thread case _also_ holds the lock.

This is actually significantly safer (may even be completely safe) than the previous one, though still *expletive* hard to read since the lock acquisition happens in the TelemetryHistogram::* functions instead of the internal_* ones.

:gfritzsche, would it follow the module design if I made a full TelemetryHistogram::* function for acquiring the lock and arming the timer? Then the call is

if (NS_IsMainThread()) {
  internal_armIPCTimerMainThread();
} else {
  NS_DispatchToMainThread(NS_NewRunnableFunction(&TelemetryHistogram::armIPCTimerMainThread));
}

With TelemetryHistogram::armIPCTimerMainThread simply being

StaticMutexAutoLock locker(gTelemetryHistogramMutex);
internal_armIPCTimerMainThread();
Flags: needinfo?(nfroyd)
Flags: needinfo?(gfritzsche)
(In reply to Chris H-C :chutten from comment #158)
> internal_* methods are only called when holding the lock (excepting rare
> JS-related use-cases discussed at length in the de-racing Telemetry bug
> 1258183), so the main thread case already held the lock. What this does is
> ensure the non-main-thread case _also_ holds the lock.
> 
> This is actually significantly safer (may even be completely safe) than the
> previous one, though still *expletive* hard to read since the lock
> acquisition happens in the TelemetryHistogram::* functions instead of the
> internal_* ones.

Ah, OK, my mistake.  (This is why I prefer passing |const {Static,}MutexAutoLock&| parameters around, so one doesn't have to remember any sort of naming convention to determine whether or not the lock is held...)

But we still might have a storm of these events being dispatched to the main thread, right?

Thread1: Dispatch runnable function.
Thread1: Dispatch runnable function.
Thread2: Dispatch runnable function.
...
Main thread: Finally get around to processing all the runnable functions, only one of which actually does anything.
Flags: needinfo?(nfroyd)
Ack! I forgot to put in the if (gIPCTimerArmed) { return; } at the top.
I apologize for the churn on this patch. It is getting rather ridiculous.
Attachment #8787737 - Attachment is obsolete: true
Flags: needinfo?(gfritzsche)
Attachment #8788461 - Flags: review?(nfroyd)
Comment on attachment 8788461 [details] [diff] [review]
0016-bug-1218576-Ensure-IPCTimer-is-on-the-main-thread.-r.patch

Review of attachment 8788461 [details] [diff] [review]:
-----------------------------------------------------------------

No problem!  If I had caught these timer issues the first time around, we could have avoided a lot of this unpleasantness. :(

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1282,3 @@
>  {
> +  MOZ_ASSERT(NS_IsMainThread());
> +  gIPCTimerArming = false;

It seems a little weird to modify this here (since the arming process has not yet completed), but since all accesses to this are done under the lock, I think it's OK.  Moving it would also make things more complicated, I think.
Attachment #8788461 - Flags: review?(nfroyd) → review+
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc16cda7781b
Accumulate child histograms in the parent process r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/7fdd6b6ab594
Pre-allocate child keyed histograms in the parent process r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/baf105cbe0c8
Remove requestChildPayloads r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb5dbe28ab20
Move aggregated child telemetry to processes.content r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/df28918fe236
Support remote accumulation via JS histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/1678482b2fad
Update test_ChildHistograms for child telemetry aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/c68fc5ad5ecf
Update test browser_use_counters.js to use aggregated child telemetry r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/3caacb5c213b
Assert we're the parent process when clearing histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ce2a2dabb042
Ensure remaining batched telemetry is flushed on content process shutdown r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b6666eb3b81
Ensure Telemetry IDs coming from IPC are valid r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/848f4ef30978
Ensure we can record before trying to record categorical histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ea7282078b05
Aggregate child categorical histograms in the parent process. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d15ae92f2fa
Test child categorical histogram aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/57f9849f0f8f
Change unused.h to Unused.h, which matters to some filesystems. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/62009556e4ad
Ensure IPCTimer is on the main thread. r=froydnj f=gfritzsche
See patch 8 for how the test was modified to accommodate the browser_use_counters test.

The test is timing-dependent: bug 1224873 is the intermittent version.

Basically, we need to wait extra-long for slow Linux builds.

:philor also pointed out Weird things like https://treeherder.mozilla.org/logviewer.html#?job_id=35459411&repo=mozilla-inbound#L17763 traced to the short tenure I had on inbound. The isValid assert should only fail if messages are getting corrupted in transit. Unlike the threading issue which caused this earlier, there's no evidence that anything else is going awry beforehand, during the test. However, there's a crash afterwards, so that's no good.

Other weird things are:

OOMs appending to the accumulations array (wait, what? Why? How?): https://treeherder.mozilla.org/logviewer.html#?job_id=35459785&repo=mozilla-inbound#L13606

A different webrtc failure (hang) which may have to do with a dropped message (SendAccumulateChildHistogram failed): https://treeherder.mozilla.org/logviewer.html#?job_id=35453310&repo=mozilla-inbound#L17109

A SIGSEV in AccumulateChild: https://treeherder.mozilla.org/logviewer.html#?job_id=35459411&repo=mozilla-inbound#L17763

Probably the same crash (bad address instead of SIGSEV), which :Aryx noted above: https://treeherder.mozilla.org/logviewer.html#?job_id=35464223&repo=mozilla-inbound#L13106

And an Android SIGABRT: https://treeherder.mozilla.org/logviewer.html#?job_id=35465646&repo=mozilla-inbound#L1924

Actions:
 * Maybe flushing the Telemetry in the use counters test will take care of enough of the "slow ASAN being slow" problem to allow it to pass. Otherwise, I may need to extend the waitForCondition timeout.
 * Figure out the OOM. Is it possible for the accumulations array to get large enough to OOM? We depend on a timer to "drain" the array on the main thread, so maybe it's there? Why wouldn't this happen on other platforms?
 * The SIGSEV, Bad address in Accumulate Child: It's happening in or around the isValid assert. What is going on there? I need to think on this more and come up with cogent questions and a list of people to ask them of.
 * Determine if any of these are reasonably reproducible on try, so I have a verification step to test things on. I notice none of these are failures on Windows, on which I'm developing, so that requires that I have a decent build+debug+validate cycle or else all is lost.



So much for "third time's the charm"
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #166)
> See patch 8 for how the test was modified to accommodate the
> browser_use_counters test.
> 
> The test is timing-dependent: bug 1224873 is the intermittent version.
> 
> Basically, we need to wait extra-long for slow Linux builds.
> 
> :philor also pointed out Weird things like
> https://treeherder.mozilla.org/logviewer.html#?job_id=35459411&repo=mozilla-
> inbound#L17763 traced to the short tenure I had on inbound. The isValid
> assert should only fail if messages are getting corrupted in transit. Unlike
> the threading issue which caused this earlier, there's no evidence that
> anything else is going awry beforehand, during the test. However, there's a
> crash afterwards, so that's no good.
> 
> Other weird things are:
> 
> OOMs appending to the accumulations array (wait, what? Why? How?):
> https://treeherder.mozilla.org/logviewer.html#?job_id=35459785&repo=mozilla-
> inbound#L13606
> 
> A different webrtc failure (hang) which may have to do with a dropped
> message (SendAccumulateChildHistogram failed):
> https://treeherder.mozilla.org/logviewer.html#?job_id=35453310&repo=mozilla-
> inbound#L17109
> 
> A SIGSEV in AccumulateChild:
> https://treeherder.mozilla.org/logviewer.html#?job_id=35459411&repo=mozilla-
> inbound#L17763
> 
> Probably the same crash (bad address instead of SIGSEV), which :Aryx noted
> above:
> https://treeherder.mozilla.org/logviewer.html#?job_id=35464223&repo=mozilla-
> inbound#L13106
> 
> And an Android SIGABRT:
> https://treeherder.mozilla.org/logviewer.html#?job_id=35465646&repo=mozilla-
> inbound#L1924
> 
> Actions:
>  * Maybe flushing the Telemetry in the use counters test will take care of
> enough of the "slow ASAN being slow" problem to allow it to pass. Otherwise,
> I may need to extend the waitForCondition timeout.
>  * Figure out the OOM. Is it possible for the accumulations array to get
> large enough to OOM? We depend on a timer to "drain" the array on the main
> thread, so maybe it's there? Why wouldn't this happen on other platforms?

How about adding a MOZ_ASSERT() on passing a maximum accumulations array size to make this clearer?

>  * The SIGSEV, Bad address in Accumulate Child: It's happening in or around
> the isValid assert. What is going on there? I need to think on this more and
> come up with cogent questions and a list of people to ask them of.

Looking at this from here:
- https://treeherder.mozilla.org/logviewer.html#?job_id=35459411&repo=mozilla-inbound#L17763
- Assertion failure: isValid, at /home/worker/workspace/build/src/toolkit/components/telemetry/TelemetryHistogram.cpp:2177
- https://hg.mozilla.org/integration/mozilla-inbound/file/5db21f7f4efa2eb6d8e59beef7f97f64001b2b6f/toolkit/components/telemetry/TelemetryHistogram.cpp#l2177
... lets do the isValid check in the child too when initially receiving the accumulation?
That way we can actually get a stack to blame callers (or we find out that we transmitted corrupted data?).

If this is other code accumulating out-of-bounds IDs, we might have to:
- allow this for now
- log an error
- try to enforce proper asserts on this in a follow-up bug
1) Putting a size constraint on the number of Accumulation structs allowed in the accumulations arrays takes care of the OOMs and the isValid asserts. It seems as though, when extreme demands are made of it, the IPC layer can bork in interesting ways.

I'm guessing 5000 is a decent enough maximum size for now. I can file a follow-up for tuning that number if you'd like (or for recording how often it is exceeded in the wild)

2) browser_use_counters.js on ASan is failing because of a default configuration of canRecordExtended:false. I'm not sure why that's different on that platform than others, but I do know why it causes a problem only _with_ my changes:

Without my changes, the recording of a histogram is governed by canRecord{Base,Extended} on the process on which the Accumulate call is run. With my changes, it is governed by that setting on both the process on which Accumulate is run _and_ on the parent process which stores it.

So, how to fix?
1) Change the test so it sets and restores canRecordExtended on both processes?
2) Change the code to not care what the parent's canRecordX setting is if the accumulation came from the child?

I'm leaning towards 1), since having canRecordExtended be able to be meaningfully set to different values on different processes sounds like a feature request.
(In reply to Chris H-C :chutten from comment #168)
> 2) browser_use_counters.js on ASan is failing because of a default
> configuration of canRecordExtended:false. I'm not sure why that's different
> on that platform than others, but I do know why it causes a problem only
> _with_ my changes:
> 
> Without my changes, the recording of a histogram is governed by
> canRecord{Base,Extended} on the process on which the Accumulate call is run.
> With my changes, it is governed by that setting on both the process on which
> Accumulate is run _and_ on the parent process which stores it.
> 
> So, how to fix?
> 1) Change the test so it sets and restores canRecordExtended on both
> processes?
> 2) Change the code to not care what the parent's canRecordX setting is if
> the accumulation came from the child?
> 
> I'm leaning towards 1), since having canRecordExtended be able to be
> meaningfully set to different values on different processes sounds like a
> feature request.

We should do 1) and fix the test - this is a scenario that we should never experience in Firefox.

Re 2), i can see canRecord* being out of sync temporarily between parent & content process, so we still should check the recording state on both sides (in the child for "efficiency", in the parent for "correctness").
I filed bug 1303239 on improving the propagation of canRecord*, but even after that we could still e.g. have in-flight accumulations coming from the child.
Attachment #8791972 - Flags: review?(gfritzsche)
Comment on attachment 8791977 [details] [diff] [review]
0019-bug-1218576-Turn-canRecordExtended-on-for-browser_us.patch

Review of attachment 8791977 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/test/browser_use_counters.js
@@ +80,4 @@
>  });
>  
>  add_task(function* () {
> +  CC["@mozilla.org/base/telemetry;1"].getService(Ci.nsITelemetry).canRecordExtended = gOldParentCanRecord;

let telemetry = ...
telemetry.canRecordExtended = ...
Attachment #8791977 - Flags: review?(gfritzsche) → review+
Comment on attachment 8791972 [details] [diff] [review]
0017-bug-1218576-Ensure-child-accumulations-are-for-valid.patch

Review of attachment 8791972 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1313,5 @@
>      return false;
>    }
> +  bool isValid = internal_IsHistogramEnumId(aId);
> +  MOZ_ASSERT(isValid, "Child telemetry accumulation using invalid id");
> +  if (!isValid) {

I think we should move this up into `internal_Accumulate()`, to also cover non-remote accumulations with the same check.

@@ +1334,5 @@
>    }
> +  bool isValid = internal_IsHistogramEnumId(aId);
> +  MOZ_ASSERT(isValid, "Child keyed telemetry accumulation using invalid id");
> +  if (!isValid) {
> +    return true;

Ditto.
Attachment #8791972 - Flags: review?(gfritzsche)
Attachment #8791974 - Flags: review?(gfritzsche) → review+
Attachment #8791972 - Attachment is obsolete: true
Attachment #8792556 - Flags: review?(gfritzsche)
Comment on attachment 8792556 [details] [diff] [review]
0017-bug-1218576-Ensure-accumulations-are-for-valid-IDs.patch

Review of attachment 8792556 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1344,1 @@
>        internal_RemoteAccumulate(aHistogram, aSample)) {

Nit: No need to put `!isValid` on a new line.
Attachment #8792556 - Flags: review?(gfritzsche) → review+
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9b792ae11734
Accumulate child histograms in the parent process r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/832793e20c54
Support subsession hgrams for child aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b2b8973c8a2
Pre-allocate child keyed histograms in the parent process r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/712681439306
Remove requestChildPayloads r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/f5a08bea2e17
Move aggregated child telemetry to processes.content r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/206ce3f7b0cd
Support remote accumulation via JS histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b674e3249ec
Update test_ChildHistograms for child telemetry aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/021273dad9ae
Update test browser_use_counters.js to use aggregated child telemetry r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/92ed900459df
Assert we're the parent process when clearing histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/83d24a3dc1a9
Ensure remaining batched telemetry is flushed on content process shutdown r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/02bee0c4865a
Ensure Telemetry IDs coming from IPC are valid r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e2045544766
Ensure we can record before trying to record categorical histograms. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/c97a5b2b086a
Aggregate child categorical histograms in the parent process. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/5d3383a8df5f
Test child categorical histogram aggregation r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3b858440d01
Change unused.h to Unused.h, which matters to some filesystems. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/fcfa94c81e0a
Ensure IPCTimer is on the main thread. r=froydnj f=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b498933da3a
Ensure accumulations are for valid IDs. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca430b715311
Request an immediate batch send when the batch size hits a high water mark. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a20e1ffb8ea
Turn canRecordExtended on for browser_use_counters.js r=gfritzsche
Depends on: 1304009
Depends on: 1304660
Depends on: 1304730
No longer depends on: 1299312
Blocks: 1304730
No longer depends on: 1304730
Depends on: 1307031
Depends on: 1282196
Depends on: 1307696
Depends on: 1304269
Depends on: 1306884
Depends on: 1306856
Chris, is this something that manual QA could help test? Adding the feature keyword for tracking.
Flags: qe-verify?
Flags: needinfo?(chutten)
Keywords: feature
For a small manual test we can direct testers to look at their about:telemetry and ensure that the Raw JSON view is showing child process telemetry under processes.content instead of childPayloads... but aside from that, this is mostly internal stuff. If I've done it correctly, you shouldn't notice much at all :)
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #180)
> For a small manual test we can direct testers to look at their
> about:telemetry and ensure that the Raw JSON view is showing child process
> telemetry under processes.content instead of childPayloads... but aside from
> that, this is mostly internal stuff. If I've done it correctly, you
> shouldn't notice much at all :)

Thank you for following up on this, Chris! I think we can treat this as regular bug work instead and just make sure that the test you described passes each time, as this bug rides the trains.
Flags: qe-verify? → qe-verify+
Depends on: 1312778
I can confirm that latest Nightly (build ID 20161025030205) does no longer show child process telemetry under childPayloads, but under processes.content.
Status: RESOLVED → VERIFIED
Depends on: 1356898
Whiteboard: [measurement:client][fce-active] → [measurement:client][fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: