Closed Bug 1313326 Opened 3 years ago Closed 3 years ago

Support recording events in content processes

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

(Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

Attachments

(6 files, 14 obsolete files)

35.17 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
3.79 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
3.21 KB, patch
chutten
: review+
Details | Diff | Splinter Review
4.55 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
12.02 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
76.06 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
After bug 1302663 adds support for event recording in the parent process, we need to add support for recording them in content processes too.
Priority: P3 → P1
Assignee: nobody → gfritzsche
Points: --- → 3
Comment on attachment 8834067 [details] [diff] [review]
Part 1 - Refactor Telemetry IPC code

Review of attachment 8834067 [details] [diff] [review]:
-----------------------------------------------------------------

This tries to refactor the Telemetry IPC code a bit from "each process implements Telemetry messages" to "one interface implements the Telemetry messages for all childs".

Gabor, is that something you could look over or redirect to someone?
I'm interested specifically in feedback on how the IPC communication & actors are hooked up in:
- {PTelemetry,PContent,PGPU}.ipdl
- {Telemetry,Content,GPU}(Parent,Child}.{h,cpp}
- TelemetryIPCAccumulator.cpp

The idea is to share one child instance of the PTelemetry protocol in each child process.
Childs call into TelemetryChild, TelemetryParents turns those calls into actual Telemetry accumulations in the parent process.

I'm not sure if i got the actor lifetimes handled correctly right now.
TelemetryIPCAccumulator::DeInitializeGlobalState() is presumably called rather late, does the child actor deletion/teardown have to happen at some specific point here?

A blocking problem i'm running into is how to manage PTelemetry from PGPU.
The GPU process handling has the parent/child methodology backwards AFAICT.
So, the way the (commented out) code in PGPU/GPUChild/GPUParent add managing PTelemetry i get errors like:
> class GPUParent final : public PGPUParent
>       ^
> [...]/mc1/obj/ipc/ipdl/_ipdlheaders/mozilla/gfx/PGPUParent.h:213:5: note: unimplemented pure virtual method 'AllocPTelemetryParent' in 'GPUParent'
>     AllocPTelemetryParent(const uint32_t& aProcessType) = 0;

So it looks like the IPDL generator expects me to construct TelemetryParent in GPUParent, but GPUParent is in the child process.
Changing to this model doesn't match what i do with ContentParent/ContentChild.
I wonder how i can get PTelemetry exposed via PGPU without duplicating any code here.
Attachment #8834067 - Flags: feedback?(gkrizsanits)
Comment on attachment 8834067 [details] [diff] [review]
Part 1 - Refactor Telemetry IPC code

Review of attachment 8834067 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Georg Fritzsche [:gfritzsche] from comment #2)
> Comment on attachment 8834067 [details] [diff] [review]
> Part 1 - Refactor Telemetry IPC code
> 
> Review of attachment 8834067 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This tries to refactor the Telemetry IPC code a bit from "each process
> implements Telemetry messages" to "one interface implements the Telemetry
> messages for all childs".
> 
> Gabor, is that something you could look over or redirect to someone?
> I'm interested specifically in feedback on how the IPC communication &
> actors are hooked up in:
> - {PTelemetry,PContent,PGPU}.ipdl
> - {Telemetry,Content,GPU}(Parent,Child}.{h,cpp}
> - TelemetryIPCAccumulator.cpp
> 
> The idea is to share one child instance of the PTelemetry protocol in each
> child process.
> Childs call into TelemetryChild, TelemetryParents turns those calls into
> actual Telemetry accumulations in the parent process.
> 
> I'm not sure if i got the actor lifetimes handled correctly right now.
> TelemetryIPCAccumulator::DeInitializeGlobalState() is presumably called
> rather late, does the child actor deletion/teardown have to happen at some
> specific point here?

I'll look into this part later let's get the PGPU part sorted out first.

> 
> A blocking problem i'm running into is how to manage PTelemetry from PGPU.
> The GPU process handling has the parent/child methodology backwards AFAICT.
> So, the way the (commented out) code in PGPU/GPUChild/GPUParent add managing
> PTelemetry i get errors like:
> > class GPUParent final : public PGPUParent
> >       ^
> > [...]/mc1/obj/ipc/ipdl/_ipdlheaders/mozilla/gfx/PGPUParent.h:213:5: note: unimplemented pure virtual method 'AllocPTelemetryParent' in 'GPUParent'
> >     AllocPTelemetryParent(const uint32_t& aProcessType) = 0;
> 

Yeah, I was not aware of this either, which makes it questionable that I'm the right person to review this. Explanation for why are we doing this: bug 1271180 comment 8, which I think makes a lot of sense and probably you should ask for review from dvander.

> So it looks like the IPDL generator expects me to construct TelemetryParent
> in GPUParent, but GPUParent is in the child process.
> Changing to this model doesn't match what i do with
> ContentParent/ContentChild.
> I wonder how i can get PTelemetry exposed via PGPU without duplicating any
> code here.

In the same bug I mentioned above the last patch for PCrashReporter seems to be quite useful to get the IPC skeleton you need right. https://bugzilla.mozilla.org/attachment.cgi?id=8759898&action=diff

It is using manages PCrashReporter (parent creates client); I think that should fix your problem and the patch gives you an idea how to implement these bits manually.
Attachment #8834067 - Flags: feedback?(gkrizsanits)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #3)
> > A blocking problem i'm running into is how to manage PTelemetry from PGPU.
> > The GPU process handling has the parent/child methodology backwards AFAICT.
> > So, the way the (commented out) code in PGPU/GPUChild/GPUParent add managing
> > PTelemetry i get errors like:
> > > class GPUParent final : public PGPUParent
> > >       ^
> > > [...]/mc1/obj/ipc/ipdl/_ipdlheaders/mozilla/gfx/PGPUParent.h:213:5: note: unimplemented pure virtual method 'AllocPTelemetryParent' in 'GPUParent'
> > >     AllocPTelemetryParent(const uint32_t& aProcessType) = 0;
> > 
> 
> Yeah, I was not aware of this either, which makes it questionable that I'm
> the right person to review this. Explanation for why are we doing this: bug
> 1271180 comment 8, which I think makes a lot of sense and probably you
> should ask for review from dvander.
> 
> > So it looks like the IPDL generator expects me to construct TelemetryParent
> > in GPUParent, but GPUParent is in the child process.
> > Changing to this model doesn't match what i do with
> > ContentParent/ContentChild.
> > I wonder how i can get PTelemetry exposed via PGPU without duplicating any
> > code here.
> 
> In the same bug I mentioned above the last patch for PCrashReporter seems to
> be quite useful to get the IPC skeleton you need right.
> https://bugzilla.mozilla.org/attachment.cgi?id=8759898&action=diff
> 
> It is using manages PCrashReporter (parent creates client); I think that
> should fix your problem and the patch gives you an idea how to implement
> these bits manually.

Ok, it seems that the `(parent creates client)` syntax in bug 1277439 was dropped and PGPU doesn't manage PCrashReporter (which apparently switched to just shmem communication [1]).

dvander, do you have a suggestion for how i can hook this up now with the GPU process?

1: https://dxr.mozilla.org/mozilla-central/rev/af8a2573d0f1e9cc6f2ba0ab67d7a702a197f177/ipc/glue/CrashReporterHost.h#23
Flags: needinfo?(dvander)
This IPDL restriction is unfortunately really annoying. We tried a new IPDL model in bug 1277439 but it was deemed too risky. As you noted for the past two cases we just removed the actors: PCrashReporter was bypassed and PMemoryRequest was deleted.

You could do that here as well, drop PTelemetry.ipdl and instead manually create child/parent (or client/host) interfaces that the top-level process actors have to implement. The messages would have to be duplicated across PGPU.ipdl and PContent.ipdl, which is a little annoying.

Another option would be some template trickery. Since your protocol only has async messages, make it an async protocol instead of sync. Then use "both:" for the direction. Both the Child and Parent interfaces will then have the same methods. Then you could compose the class with templates. For example:

> // <T> must be a PTelemetryParent or PTelemetryChild.
> template <typename T>
> class TelemetryHost : public T
> {
> public:
>    // Parent-process methods.
> };
>
> // <T> must be a PTelemetryParent or PTelemetryChild.
> template <typename T>
> class TelemetryClient : public T
> {
> public:
>   // Sub-process methods.
> };

Your PContent alloc methods would look like:

> PTelemetryChild* ContentChild::AllocPTelemetryChild() {
>   return new TelemetryClient<PTelemetryChild>();
> };
>
> PTelemetryParent* ContentParent::AllocPTelemetryParent() {
>   return new TelemetryHost<PTelemetryParent>();
> };

But the PGPU versions would be flipped, so the Child returns a Host composed of a PTelemetryChild.

> PTelemetryChild* GPUChild::AllocPTelemetryChild() {
>   return new TelemetryHost<PTelemetryChild>();
> };
>
> PTelemetryParent* GPUParent::AllocPTelemetryParent() {
>   return new TelemetryClient<PTelemetryParent>();
> };
Flags: needinfo?(dvander)
I'm skipping consolidating the functionality into a single interface for now and limit this to a bit of refactoring and cleanup. We can still get to the single Telemetry interface later (e.g. by using a common mixin/stub).
Attachment #8834067 - Attachment is obsolete: true
Chris, can you review this? It mostly re-organizes the Telemetry IPC code to be somewhat more contained.
Attachment #8836777 - Flags: review?(chutten)
Attachment #8835492 - Attachment is obsolete: true
Alessio, can you look over this? Note that i'm following up with patches for: restricting recording to specific processes, fixing up the child event timestamps, updating the docs.
Attachment #8836778 - Flags: review?(alessio.placitelli)
Comment on attachment 8836777 [details] [diff] [review]
Part 1 - Refactor Telemetry IPC code

Review of attachment 8836777 [details] [diff] [review]:
-----------------------------------------------------------------

I do like that ipc things are now in ipc/ so that it's clear what's going on. Putting the *Child functions in there is also a nice way to protect against users just deciding to use Telemetry::AccumulateChild without knowing what it is.

Seems legit.
Attachment #8836777 - Flags: review?(chutten) → review+
Comment on attachment 8836778 [details] [diff] [review]
Part 2 - Support recording events in child processes

Review of attachment 8836778 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, Georg! The only two things I'm really concerned about are the InfallibleTArray and the event clearing issues. So it's basically r+ once these points are cleared.

::: dom/ipc/ContentParent.cpp
@@ +4888,5 @@
>    return IPC_OK();
>  }
>  
> +mozilla::ipc::IPCResult
> +ContentParent::RecvRecordChildEvents(nsTArray<mozilla::Telemetry::ChildEventData>&& aEvents)

Why are you using fallible nsTArray instead of InfallibleTArray (here and in GPUChild.*)?

::: toolkit/components/telemetry/TelemetryEvent.cpp
@@ +18,5 @@
>  #include "jsapi.h"
>  #include "nsJSUtils.h"
>  #include "nsXULAppAPI.h"
>  #include "nsUTF8Utils.h"
> +#include "nsXULAppAPI.h"

It looks like you already have this include two lines above. Did you mean to remove the other one?

@@ +763,5 @@
>    if (!gInitDone) {
>      return;
>    }
>  
> +  EventRecordArray* eventRecords = GetEventRecordsForProcess(lock, GeckoProcessType_Default);

Why isn't this just gEventRecords.Clear()?

::: toolkit/components/telemetry/ipc/TelemetryComms.h
@@ +303,5 @@
>  };
>  
> +template<>
> +struct
> +ParamTraits<mozilla::Telemetry::ChildEventData>

The serializers look good to me. However, it looks like we might need another review for it. See the end of the page at https://developer.mozilla.org/en-US/docs/Mozilla/IPDL/Type_Serialization

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp
@@ +183,5 @@
> +  if (!gChildEvents) {
> +    gChildEvents = new nsTArray<ChildEventData>();
> +  }
> +  // Store the action.
> +  gChildEvents->AppendElement(ChildEventData{timestamp, nsCString(category),

Are you waiting for the conversation to settle in bug 1338555 before considering adding a limit to event accumulation here?

::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.h
@@ +6,5 @@
>  #ifndef TelemetryIPCAccumulator_h__
>  #define TelemetryIPCAccumulator_h__
>  
>  #include "mozilla/AlreadyAddRefed.h"
> +#include "mozilla/Tuple.h"

Where is Tuple used?

::: toolkit/components/telemetry/tests/unit/xpcshell.ini
@@ +62,5 @@
>  [test_TelemetryTimestamps.js]
>  skip-if = toolkit == 'android'
>  [test_TelemetryCaptureStack.js]
>  [test_TelemetryEvents.js]
> +[test_ChildEvents.js]

You would probably need to disable this on Android builds. See bug 1331366.
Attachment #8836778 - Flags: review?(alessio.placitelli) → feedback+
(In reply to Alessio Placitelli [:Dexter] from comment #10)
> > +mozilla::ipc::IPCResult
> > +ContentParent::RecvRecordChildEvents(nsTArray<mozilla::Telemetry::ChildEventData>&& aEvents)
> 
> Why are you using fallible nsTArray instead of InfallibleTArray (here and in
> GPUChild.*)?

Its the same type:
https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/xpcom/ds/nsTArrayForwardDeclare.h#34

> > +  gChildEvents->AppendElement(ChildEventData{timestamp, nsCString(category),
> 
> Are you waiting for the conversation to settle in bug 1338555 before
> considering adding a limit to event accumulation here?

I'll take this to a separate patch with details pending that bug.
Addressed feedback aside from what is mentioned in comment 11.
Attachment #8837133 - Flags: review?(alessio.placitelli)
Attachment #8836778 - Attachment is obsolete: true
Attachment #8837133 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8837133 [details] [diff] [review]
Part 2 - Support recording events in child processes

Review of attachment 8837133 [details] [diff] [review]:
-----------------------------------------------------------------

Nathan, can you specifically review in TelemetryComms.h the new IPC (de)serialization of `ChildEventData`?
Context: I am adding support for recording in content processes for event telemetry [1].

1: https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/events.html
Attachment #8837133 - Flags: review?(nfroyd)
This restricts event recording to the specified processes. Some Python & C++ code was refactored for sharing between measurement types.
Attachment #8837326 - Flags: review?(alessio.placitelli)
Comment on attachment 8837326 [details] [diff] [review]
Part 3 - Restrict event recording to specified processes

Review of attachment 8837326 [details] [diff] [review]:
-----------------------------------------------------------------

Nice refactoring!

::: toolkit/components/telemetry/gen-event-data.py
@@ +79,5 @@
>                   extras[0], # extra keys index
>                   extras[1], # extra keys count
>                   e.expiry_day,
> +                 e.dataset,
> +                 " | ".join(e.record_in_processes_enum),),

nit: there's a trailing comma before the parenthesis -> ),)
Attachment #8837326 - Flags: review?(alessio.placitelli) → review+
Depends on: 1339749
Comment on attachment 8836777 [details] [diff] [review]
Part 1 - Refactor Telemetry IPC code

Moved this to bug 1339749.
Attachment #8836777 - Attachment is obsolete: true
Comment on attachment 8837133 [details] [diff] [review]
Part 2 - Support recording events in child processes

Review of attachment 8837133 [details] [diff] [review]:
-----------------------------------------------------------------

WFM!

::: toolkit/components/telemetry/TelemetryEvent.cpp
@@ +397,5 @@
> +      }
> +    }
> +
> +    // Add the optional string value only when needed.
> +    // When extra is empty and this has no value, we can save a little space.

Did you mean "...and this record has no value"?
Attachment #8837133 - Flags: review?(nfroyd) → review+
Comment on attachment 8837647 [details] [diff] [review]
Part 1 - Support recording events in child processes

Locally broken patches.
Attachment #8837647 - Attachment is obsolete: true
Attachment #8837326 - Attachment is obsolete: true
Attachment #8837133 - Attachment is obsolete: true
Attachment #8837657 - Flags: review+
Attachment #8837658 - Flags: review+
Comment on attachment 8837659 [details] [diff] [review]
Part 3 - Update about:telemetry to show child process events

Review of attachment 8837659 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good with the comment addressed.

::: toolkit/content/aboutTelemetry.js
@@ +1749,2 @@
>      const hasData = events && Object.keys(events).length > 0;
>      setHasData("events-section", hasData);

This should be: hasData || processesSelect.options.length
Attachment #8837659 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8837660 [details] [diff] [review]
Part 4 - Flush the event IPC buffer when hitting a limit

Review of attachment 8837660 [details] [diff] [review]:
-----------------------------------------------------------------

Seems legit.
Attachment #8837660 - Flags: review?(chutten) → review+
Comment on attachment 8838135 [details] [diff] [review]
Part 5 - Use correct child event timestamps

With the previous patches, we have events recorded in different processes, with the relative timestamp (ms since process start) for their process.
This means we can't put parent and child events on the same timeline for analysis.
With this patch, we track the offset between parent and child times and use it to adjust the child process events timestamp.

@Chris, does this look overall sensible to you?
Do the Telemetry parts look correct?

@Nathan, do you think it makes sense to track the time deltas between processes this way in ContentParent/GPUChild?
I took the approach of just sending the "ms since child process start" from the child process, send it to the parent and store the delta to "ms since parent process start", then use that to adjust the event times.
But i could imagine scenarios (busy/spammed event loops) where the sending of the MsSinceProcessStart message might take long.
I can also think of just sending the childs TimeStamp::ProcessCreation() time, but i'm worried that those might actually be off compared to the clock the parent sees.
Attachment #8838135 - Flags: review?(nfroyd)
Attachment #8838135 - Flags: review?(chutten)
Comment on attachment 8838135 [details] [diff] [review]
Part 5 - Use correct child event timestamps

Review of attachment 8838135 [details] [diff] [review]:
-----------------------------------------------------------------

This is my first time looking at event telemetry code, so I'm not sure I'd catch everything (for instance, if you missed something). However, it looks like a very straightforward plumbing operation of getting the time deltas where you want them.

Questions: Why aren't you using a "thicker" type like TimeDuration? Is "synchronizing process clocks" really a game we want to get into? Has this not already been solved?
Attachment #8838135 - Flags: review?(chutten) → feedback+
Attachment #8838181 - Attachment is obsolete: true
(In reply to Chris H-C :chutten from comment #28)
> Questions: Why aren't you using a "thicker" type like TimeDuration? Is
> "synchronizing process clocks" really a game we want to get into? Has this
> not already been solved?

A type like TimeDuration didn't seem necessary for the problem here. It might make sense though if we were to re-use this for other code.

I would love to avoid dealing with synchronizing clocks if we have any better options - i hope for Nathans input here.
If we don't have better/existing options, i still need to solve it to make child process events useful.
With child Telemetry being batched and sent every two seconds, we otherwise won't be able to properly look at event orders across processes.
(e.g. answering: which action/event in the parent processes preceded/caused this content process event?)
Comment on attachment 8838135 [details] [diff] [review]
Part 5 - Use correct child event timestamps

Review of attachment 8838135 [details] [diff] [review]:
-----------------------------------------------------------------

This looks reasonable to me, but I am also aware that time can be a tricky thing.  So I'm not totally ready to r+.

Do you really need to have all events ever collected on one global timeline?  I can understand the appeal of that for a profiling system (e.g. event tracing), but for telemetry, why would we care?
Attachment #8838135 - Flags: review?(nfroyd)
The motivation for the event data collection (Event Telemetry) is to follow user flow and correlate it with other events.
At the very least we'd need to have a global ordering of events for that, but times may be relevant (did something happen right after that click? or 10 seconds after?).
What happens is distributed over multiple processes now with e10s, but we shouldn't have engineers or analysts worry about those details.
It should be enough to just tell Telemetry to record an event client-side. Then in re:dash etc. analysts get all those events show up in order & with the relative times, without any additional work.

To get relative timestamps right, we can fix them up client- or server-side. Either way though, we need to collect the time offset between processes somewhere.
The only alternative i see to timestamp fix-up is to immediately send up all content events without batching - but that could substantially increase IPC traffic etc.

Does that put it better into context?
Flags: needinfo?(nfroyd)
(In reply to Georg Fritzsche [:gfritzsche] from comment #33)
> The motivation for the event data collection (Event Telemetry) is to follow
> user flow and correlate it with other events.

That sounds like a ton of overhead while running the browser, and a ton of data to sift through.

> Then in re:dash etc. analysts get all those events show up in order & with
> the relative times, without any additional work.

For my own edification, what is "re:dash, etc."?

> To get relative timestamps right, we can fix them up client- or server-side.
> Either way though, we need to collect the time offset between processes
> somewhere.
> The only alternative i see to timestamp fix-up is to immediately send up all
> content events without batching - but that could substantially increase IPC
> traffic etc.

One alternative would be to keep sent/received counters for IPC messages and tag all events with those in addition to the (process-local) timestamp.  That would permit you to do the correlation between parent and child events without resorting to calculating time differences...though it would increase the size of telemetry events significantly.  I guess if you weren't sending IPC messages for a span of time, and several things happened in the parent and child processes, you wouldn't be able to determine a total ordering between those events.

> Does that put it better into context?

That helps, yes.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #34)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #33)
> > The motivation for the event data collection (Event Telemetry) is to follow
> > user flow and correlate it with other events.
> 
> That sounds like a ton of overhead while running the browser, and a ton of
> data to sift through.

We have instrumentation for about 1.4k probes at the moment [1].
I don't think we need to worry about adding some event instrumentation, as long as we are aware of the performance impact.
The potential data volume is high, yes. We are handling this in other bugs like bug 1329139 - right now event recording is off by default and experiments like TestPilot or SHIELD studies can activate this.

> 
> > Then in re:dash etc. analysts get all those events show up in order & with
> > the relative times, without any additional work.
> 
> For my own edification, what is "re:dash, etc."?

re:dash is currently our central/common tool for querying data, see:
https://sql.telemetry.mozilla.org/

> 
> > To get relative timestamps right, we can fix them up client- or server-side.
> > Either way though, we need to collect the time offset between processes
> > somewhere.
> > The only alternative i see to timestamp fix-up is to immediately send up all
> > content events without batching - but that could substantially increase IPC
> > traffic etc.
> 
> One alternative would be to keep sent/received counters for IPC messages and
> tag all events with those in addition to the (process-local) timestamp. 
> That would permit you to do the correlation between parent and child events
> without resorting to calculating time differences...though it would increase
> the size of telemetry events significantly.  I guess if you weren't sending
> IPC messages for a span of time, and several things happened in the parent
> and child processes, you wouldn't be able to determine a total ordering
> between those events.

Right, that sounds like we are just trading for different problems.
As long as there are no fundamental problems with syncing times, i think we should try to serve the data needs/requests here.
I intend to get QA on this across important configurations (platforms & e10s with 1 & 2 content processes), that should hopefully nail down issues.

Does that cover your concerns?
Flags: needinfo?(nfroyd)
Thinking a bit more, i think we should just record with TimeStamp::Now()/NowLoRes() in the content process.
Both parent & content process will access the same system clock for this.
Then in the parent we can convert this into relative timestamps, based on TimeStamp::ProcessCreation().
(In reply to Georg Fritzsche [:gfritzsche] from comment #35)
> We have instrumentation for about 1.4k probes at the moment [1].

Where [1] is: http://georgf.github.io/fx-data-explorer/stats.html
As mentioned above, syncing timestamps across processes is actually not needed all as processes use the same system clock.
Attachment #8841505 - Flags: review?(alessio.placitelli)
Attachment #8838135 - Attachment is obsolete: true
Attachment #8841506 - Flags: review?(alessio.placitelli)
Attachment #8838182 - Attachment is obsolete: true
Comment on attachment 8841506 [details] [diff] [review]
Part 6 - Update event documentation

Review of attachment 8841506 [details] [diff] [review]:
-----------------------------------------------------------------

The current part looks good! Maybe we should mention that the 'record_in_processes' is now mandatory and the possible values (see https://dxr.mozilla.org/mozilla-central/rev/106a96755d3bcebe64bbbc3b521d65d262ba9c02/toolkit/components/telemetry/docs/collection/scalars.rst#102).

Does it make sense to add a version history at the end of the file as done with the scalars as well?
Attachment #8841506 - Flags: review?(alessio.placitelli) → feedback+
Comment on attachment 8841505 [details] [diff] [review]
Part 5 - Use correct child event timestamps

Review of attachment 8841505 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryEvent.cpp
@@ +584,5 @@
>    StaticMutexAutoLock locker(gTelemetryEventsMutex);
>    for (uint32_t i = 0; i < aEvents.Length(); ++i) {
>      const mozilla::Telemetry::ChildEventData e = aEvents[i];
> +    bool inconsistent = false;
> +    double relativeTimestamp = (e.timestamp - TimeStamp::ProcessCreation(inconsistent)).ToMilliseconds();

nit: could you add a brief comment here (and point to the docs too, if you plan to expand the idea there) about the process? Why we are subtracting, etc.?
Attachment #8841505 - Flags: review?(alessio.placitelli) → review+
Attachment #8841943 - Flags: review?(alessio.placitelli)
Attachment #8841506 - Attachment is obsolete: true
Attachment #8841943 - Flags: review?(alessio.placitelli) → review+
Attachment #8841505 - Attachment is obsolete: true
Attachment #8842009 - Flags: review+
Fixed browser_UsageTelemetry*.js tests that use event snapshots.
Attachment #8837657 - Attachment is obsolete: true
Attachment #8842415 - Flags: review+
Keywords: checkin-needed
We settled the open timestamp question per comment 36 ff.
If there are other concerns about impact on performance, storage, etc. i'm happy to discuss them.
I don't think that is specific & blocking to this bug though, feel free to file new bugs against bug 1286606 or bug 1302662.
Flags: needinfo?(nfroyd)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a6db4cae9990
Part 1: Support recording events in child processes. r=dexter, r=froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/223a0be48369
Part 2: Restrict event recording to specified processes. r=dexter
https://hg.mozilla.org/integration/mozilla-inbound/rev/fe98d700bb64
Part 3: Update about:telemetry to show child process events. r=dexter
https://hg.mozilla.org/integration/mozilla-inbound/rev/bc61502be42e
Part 4: Flush the event IPC buffer when hitting a limit. r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/49a7a91b73cc
Part 5: Use correct child event timestamps. r=dexter, f=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/e2395b3b1e16
Part 6: Update event documentation. r=dexter
Keywords: checkin-needed
Backed out for failing browser_UsageTelemetry_content.js on Linux x64 asan and debug (at least):

https://hg.mozilla.org/integration/mozilla-inbound/rev/5a561476eed268ed19b7364e15be437254ad89c8
https://bugzilla.mozilla.org/show_bug.cgi?id=1313326
https://hg.mozilla.org/integration/mozilla-inbound/rev/a54020e769ed87a70568aebaf01cbc6aa205b311
https://hg.mozilla.org/integration/mozilla-inbound/rev/a756569e4a3c12b7ce5b34ddf9437f7ac64f64bc
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a3af7c0213856d5c8fde04682d99ed2b82db3c1
https://hg.mozilla.org/integration/mozilla-inbound/rev/c7b0d1ae6e34a99a2f088574865f8e952c50532d
https://hg.mozilla.org/integration/mozilla-inbound/rev/f8cfad56018526958b5a5d2b3f7c6c0f3ae10bfc

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e2395b3b1e16f0631700f917bdac5633132be68d&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log (scroll down): https://treeherder.mozilla.org/logviewer.html#?job_id=80914160&repo=mozilla-inbound

[task 2017-03-01T17:52:25.344818Z] 17:52:25     INFO - TEST-START | browser/modules/test/browser_UsageTelemetry_content.js
[task 2017-03-01T17:52:27.134832Z] 17:52:27     INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-03-01T17:52:27.135191Z] 17:52:27     INFO - Buffered messages logged at 17:52:25
[task 2017-03-01T17:52:27.135403Z] 17:52:27     INFO - Entering test bound setup
[task 2017-03-01T17:52:27.135588Z] 17:52:27     INFO - Leaving test bound setup
[task 2017-03-01T17:52:27.135783Z] 17:52:27     INFO - Entering test bound test_context_menu
[task 2017-03-01T17:52:27.136030Z] 17:52:27     INFO - Select all the text in the page.
[task 2017-03-01T17:52:27.136212Z] 17:52:27     INFO - Open the context menu.
[task 2017-03-01T17:52:27.136403Z] 17:52:27     INFO - Buffered messages logged at 17:52:26
[task 2017-03-01T17:52:27.136571Z] 17:52:27     INFO - Click on search.
[task 2017-03-01T17:52:27.136770Z] 17:52:27     INFO - Buffered messages logged at 17:52:27
[task 2017-03-01T17:52:27.136952Z] 17:52:27     INFO - Validate the search metrics.
[task 2017-03-01T17:52:27.139019Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | browser.engagement.navigation.contextmenu must be recorded. - true == true - 
[task 2017-03-01T17:52:27.140341Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | browser.engagement.navigation.contextmenu must contain the 'search' key. - true == true - 
[task 2017-03-01T17:52:27.140704Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | 1 - 1 == true - 
[task 2017-03-01T17:52:27.145886Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | This search must only increment one entry in the scalar. - 1 == 1 - 
[task 2017-03-01T17:52:27.147845Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | The histogram must contain other-MozSearch.contextmenu. - true == true - 
[task 2017-03-01T17:52:27.152068Z] 17:52:27     INFO - TEST-PASS | browser/modules/test/browser_UsageTelemetry_content.js | The key other-MozSearch.contextmenu must contain 1. - 1 == 1 - 
[task 2017-03-01T17:52:27.153828Z] 17:52:27     INFO - Buffered messages finished
[task 2017-03-01T17:52:27.155826Z] 17:52:27     INFO - TEST-UNEXPECTED-FAIL | browser/modules/test/browser_UsageTelemetry_content.js | We should have recorded events in the parent process. - false == true - JS frame :: chrome://mochitests/content/browser/browser/modules/test/browser_UsageTelemetry_content.js :: test_context_menu :: line 84
[task 2017-03-01T17:52:27.157210Z] 17:52:27     INFO - Stack trace:
[task 2017-03-01T17:52:27.158968Z] 17:52:27     INFO - chrome://mochitests/content/browser/browser/modules/test/browser_UsageTelemetry_content.js:test_context_menu:84
[task 2017-03-01T17:52:27.160567Z] 17:52:27     INFO - resource://gre/modules/Task.jsm:TaskImpl_run:319
[task 2017-03-01T17:52:27.170543Z] 17:52:27     INFO - *************************
[task 2017-03-01T17:52:27.172471Z] 17:52:27     INFO - A coding exception was thrown and uncaught in a Task.
[task 2017-03-01T17:52:27.174460Z] 17:52:27     INFO - Full message: TypeError: events.default is undefined
[task 2017-03-01T17:52:27.176497Z] 17:52:27     INFO - Full stack: test_context_menu@chrome://mochitests/content/browser/browser/modules/test/browser_UsageTelemetry_content.js:85:3
[task 2017-03-01T17:52:27.178435Z] 17:52:27     INFO - process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:922:23
[task 2017-03-01T17:52:27.180486Z] 17:52:27     INFO - walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:806:7
[task 2017-03-01T17:52:27.182560Z] 17:52:27     INFO - Promise*scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:739:11
[task 2017-03-01T17:52:27.190695Z] 17:52:27     INFO - schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:770:7
[task 2017-03-01T17:52:27.192718Z] 17:52:27     INFO - completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:707:7
[task 2017-03-01T17:52:27.194539Z] 17:52:27     INFO - _resolveAndCallOptionalCallback@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:991:5
[task 2017-03-01T17:52:27.196228Z] 17:52:27     INFO - setTimeout handler*SimpleTest_setTimeoutShim@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:672:12
[task 2017-03-01T17:52:27.197921Z] 17:52:27     INFO - delayAgain@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:714:12
[task 2017-03-01T17:52:27.202806Z] 17:52:27     INFO - delayedCallback@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:719:19
[task 2017-03-01T17:52:27.205172Z] 17:52:27     INFO - setTimeout handler*SimpleTest_setTimeoutShim@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:672:12
[task 2017-03-01T17:52:27.208927Z] 17:52:27     INFO - _setTimeout@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:700:7
[task 2017-03-01T17:52:27.210997Z] 17:52:27     INFO - prefObs@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:1162:7
[task 2017-03-01T17:52:27.212823Z] 17:52:27     INFO - _receiveMessageAPI@chrome://mochikit/content/tests/SimpleTest/SpecialPowersObserverAPI.js:322:22
[task 2017-03-01T17:52:27.214574Z] 17:52:27     INFO - ChromePowers.prototype._receiveMessage@chrome://mochikit/content/tests/SimpleTest/ChromePowers.js:83:14
[task 2017-03-01T17:52:27.218577Z] 17:52:27     INFO - ChromePowers.prototype._sendSyncMessage@chrome://mochikit/content/tests/SimpleTest/ChromePowers.js:40:11
[task 2017-03-01T17:52:27.220467Z] 17:52:27     INFO - _setPref@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:1328:12
[task 2017-03-01T17:52:27.222214Z] 17:52:27     INFO - _applyPrefs@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:1173:9
[task 2017-03-01T17:52:27.224058Z] 17:52:27     INFO - pushPrefEnv/<@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:1108:9
[task 2017-03-01T17:52:27.226006Z] 17:52:27     INFO - Promise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:390:5
[task 2017-03-01T17:52:27.229720Z] 17:52:27     INFO - pushPrefEnv@chrome://mochikit/content/tests/SimpleTest/specialpowersAPI.js:1095:12
[task 2017-03-01T17:52:27.231563Z] 17:52:27     INFO - setup@chrome://mochitests/content/browser/browser/modules/test/browser_UsageTelemetry_content.js:26:9
[task 2017-03-01T17:52:27.233164Z] 17:52:27     INFO - TaskImpl_run@resource://gre/modules/Task.jsm:319:42
[task 2017-03-01T17:52:27.234797Z] 17:52:27     INFO - TaskImpl@resource://gre/modules/Task.jsm:277:3
[task 2017-03-01T17:52:27.236407Z] 17:52:27     INFO - asyncFunction@resource://gre/modules/Task.jsm:252:14
[task 2017-03-01T17:52:27.238497Z] 17:52:27     INFO - Task_spawn@resource://gre/modules/Task.jsm:166:12
[task 2017-03-01T17:52:27.243045Z] 17:52:27     INFO - TaskImpl_handleResultValue@resource://gre/modules/Task.jsm:389:16
[task 2017-03-01T17:52:27.244785Z] 17:52:27     INFO - TaskImpl_run@resource://gre/modules/Task.jsm:327:15
[task 2017-03-01T17:52:27.246596Z] 17:52:27     INFO - TaskImpl@resource://gre/modules/Task.jsm:277:3
[task 2017-03-01T17:52:27.248350Z] 17:52:27     INFO - asyncFunction@resource://gre/modules/Task.jsm:252:14
[task 2017-03-01T17:52:27.250017Z] 17:52:27     INFO - Task_spawn@resource://gre/modules/Task.jsm:166:12
[task 2017-03-01T17:52:27.251761Z] 17:52:27     INFO - Tester_execTest@chrome://mochikit/content/browser-test.js:735:9
[task 2017-03-01T17:52:27.253656Z] 17:52:27     INFO - Tester.prototype.nextTest</<@chrome://mochikit/content/browser-test.js:655:7
[task 2017-03-01T17:52:27.255693Z] 17:52:27     INFO - SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:791:59
[task 2017-03-01T17:52:27.259297Z] 17:52:27     INFO - *************************
[task 2017-03-01T17:52:27.278906Z] 17:52:27     INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-03-01T17:52:27.280814Z] 17:52:27     INFO - TEST-UNEXPECTED-FAIL | browser/modules/test/browser_UsageTelemetry_content.js | Uncaught exception - at chrome://mochitests/content/browser/browser/modules/test/browser_UsageTelemetry_content.js:85 - TypeError: events.default is undefined

More failures in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=ec8ac10801d6c054a3e74f89fd1857c0bdac638b&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Flags: needinfo?(gfritzsche)
This might be asan-only in the end.
Attachment #8842415 - Attachment is obsolete: true
Flags: needinfo?(gfritzsche)
Attachment #8843349 - Flags: review+
(In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #49)
> This might be asan-only in the end.

It was: This is (again) the problem of differences with ASAN Telemetry build settings.

`checkEvents()` handles that, so i fixed part 1 to leave the event assertions in browser_UsageTelemetry*.js to that function.
Keywords: checkin-needed
unable to find 'browser/modules/test/browser_UsageTelemetry_content.js' for patching
(use '--prefix' to apply patch relative to the current directory)
2 out of 2 hunks FAILED -- saving rejects to file browser/modules/test/browser_UsageTelemetry_content.js.rej
unable to find 'browser/modules/test/browser_UsageTelemetry_content_aboutHome.js' for patching
(use '--prefix' to apply patch relative to the current directory)
1 out of 1 hunks FAILED -- saving rejects to file browser/modules/test/browser_UsageTelemetry_content_aboutHome.js.rej
unable to find 'browser/modules/test/browser_UsageTelemetry_searchbar.js' for patching
(use '--prefix' to apply patch relative to the current directory)
3 out of 3 hunks FAILED -- saving rejects to file browser/modules/test/browser_UsageTelemetry_searchbar.js.rej
unable to find 'browser/modules/test/browser_UsageTelemetry_urlbar.js' for patching
(use '--prefix' to apply patch relative to the current directory)
4 out of 4 hunks FAILED -- saving rejects to file browser/modules/test/browser_UsageTelemetry_urlbar.js.rej
patch failed, unable to continue (try -v)
Flags: needinfo?(gfritzsche)
Keywords: checkin-needed
Pushed by georg.fritzsche@googlemail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/89ad9e8b7532
Part 1 - Support recording events in child processes. r=dexter,froydnj
https://hg.mozilla.org/integration/mozilla-inbound/rev/0285cee759cf
Part 2 - Restrict event recording to specified processes. r=dexter
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b6fa0e1f697
Part 3 - Update about:telemetry to show child process events. r=dexter
https://hg.mozilla.org/integration/mozilla-inbound/rev/96804319f925
Part 4 - Flush the event IPC buffer when hitting a limit. r=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab309cc6297e
Part 5 - Use correct child event timestamps. r=dexter, f=chutten
https://hg.mozilla.org/integration/mozilla-inbound/rev/93016f1e2ca6
Part 6 - Update event documentation. r=dexter
Flags: needinfo?(gfritzsche)
You need to log in before you can comment on or make changes to this bug.