Add telemetry probe for synchronous scroll performance

RESOLVED FIXED in Firefox 45

Status

()

Toolkit
Telemetry
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: avih, Assigned: avih)

Tracking

unspecified
mozilla46
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox44 wontfix, firefox45 fixed, firefox46 fixed)

Details

Attachments

(3 attachments, 15 obsolete attachments)

6.99 KB, patch
avih
: review+
Details | Diff | Splinter Review
20.38 KB, patch
avih
: review+
Details | Diff | Splinter Review
6.97 KB, patch
avih
: review+
Details | Diff | Splinter Review
(Assignee)

Description

2 years ago
Scrolling is one of the prominent use cases where we want to provide as smooth as possible animation. We want to be able to monitor how good we're doing in practice.

For async scroll (using APZ) we already have a probe which estimates the our performance - CHECKERBOARDED_CSSPIXELS_MS which was added recently at bug 1221694.

For synchronous animations we have the probes FX_REFRESH_DRIVER_CHROME_FRAME_DELAY_MS and FX_REFRESH_DRIVER_CONTENT_FRAME_DELAY_MS which does record sync scroll performance, but they log all refresh driver activity and not only while scrolling.

This bug is about adding a probe which will reflect our synchronous scrolling performance.

To do that, we'd need:
1. Know when sync scroll starts and ends.
2. Collect refresh driver frame delay data while scrolling synchronously.

I've examined the 4 combinations of e10s and APZ (currently APZ is disabled when e10s is disabled regardless of the APZ pref value) and came up with the following code paths:

Summary:

- AsyncScroll lives as long as scrolling SYNCronously (yeah...) with KB or wheel, and not while scrolling asynchronously using APZ.

- AsyncSmoothMSDScroll lives as long as scrolling SYNCronously using smoothscroll CSS/DOM API, and not while scrolling asynchronously using APZ.

- Auto Scroll is implemented fully in JS (rAF) where its start/stop/cancel points are defined at toolkit/content/browser-content.js

- WheelScrollAnimation lives while mouse wheel APZ scroll (so we don't care about it here).

- AsyncScrollBase (which AsyncScroll and WheelScrollAnimation derive from) lives on both sync and async scrolls, triggered from KB or mouse wheel.
(Assignee)

Comment 1

2 years ago
Created attachment 8692949 [details] [diff] [review]
part 1: Add telemetry histogram gating support

This patch takes a simplistic approach. For now, I don't think we need more, and the API can probably stay the same even if we decide to change the implementation later.
Attachment #8692949 - Flags: review?(vladan.bugzilla)
Attachment #8692949 - Flags: feedback?(gfritzsche)
(Assignee)

Comment 2

2 years ago
Created attachment 8692952 [details] [diff] [review]
part 2: Add telemetry probe for synchronous scroll

The instrumentation locations were chosen to reflect the observations summarized at comment 0.
Attachment #8692952 - Flags: review?(bugmail.mozilla)
Attachment #8692952 - Flags: feedback?(vladan.bugzilla)
Comment on attachment 8692952 [details] [diff] [review]
part 2: Add telemetry probe for synchronous scroll

Review of attachment 8692952 [details] [diff] [review]:
-----------------------------------------------------------------

This looks OK to me, although we end up adding this in more places it would be nice to have a RAII helper for opening and closing the gate.
Attachment #8692952 - Flags: review?(bugmail.mozilla) → review+
Should we dupe bug 1198199?
Flags: needinfo?(avihpit)
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1198199
(Assignee)

Comment 6

2 years ago
Yup.
(Assignee)

Updated

2 years ago
Flags: needinfo?(avihpit)
Comment on attachment 8692949 [details] [diff] [review]
part 1: Add telemetry histogram gating support

Review of attachment 8692949 [details] [diff] [review]:
-----------------------------------------------------------------

Did you talk to Georg about choosing an approach? Shouldn't the gate state be part of struct TelemetryHistogram?

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3815,5 @@
> +//   on such case is very small in the grand scheme of things. So we ignore it.
> +//   If we wanted to handle it, we could probably either use PR_AtomicSet for
> +//   writes, or align the array to 64 bits. For now, it's probably not worth it.
> +//   Regardless, reads don't need any guards since the state is always consistent.
> +static bool gHistogramsClosedGates[HistogramCount] = {0};

shouldn't the gate state be part of struct TelemetryHistogram?

::: toolkit/components/telemetry/Telemetry.h
@@ +90,5 @@
> + *
> + * @param id - histogram id
> + * @param isOpen - new gate state: true to open the gate, false to close it
> + */
> +nsresult GateOpen(ID id, bool isOpen);

maybe call it SetHistogramGate?
Attachment #8692949 - Flags: review?(vladan.bugzilla)
Comment on attachment 8692952 [details] [diff] [review]
part 2: Add telemetry probe for synchronous scroll

Review of attachment 8692952 [details] [diff] [review]:
-----------------------------------------------------------------

the histogram declaration is good, someone familiar with refresh driver code should review the gate open/close code
Attachment #8692952 - Flags: feedback?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #8)
> someone familiar with refresh driver code should review the gate open/close code

oops, kats already did
Comment on attachment 8692949 [details] [diff] [review]
part 1: Add telemetry histogram gating support

Review of attachment 8692949 [details] [diff] [review]:
-----------------------------------------------------------------

How can we add test-coverage for this? xpcshell?

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3815,5 @@
> +//   on such case is very small in the grand scheme of things. So we ignore it.
> +//   If we wanted to handle it, we could probably either use PR_AtomicSet for
> +//   writes, or align the array to 64 bits. For now, it's probably not worth it.
> +//   Regardless, reads don't need any guards since the state is always consistent.
> +static bool gHistogramsClosedGates[HistogramCount] = {0};

As Vladan said, this should be part of TelemetryHistogram, unless there is a good reason against that.

@@ +3818,5 @@
> +//   Regardless, reads don't need any guards since the state is always consistent.
> +static bool gHistogramsClosedGates[HistogramCount] = {0};
> +
> +static inline bool
> +IsGateExists(ID aID)

This could just be a general IsValidHistogramId() (which could maybe be re-used across Telemetry.cpp).

You should move this into the top-level anonymous namespace, no need for explicit static here and below.

Side note:
Is inline still relevant?
I thought proper optimizing compilers today just do what they think is best either way.

@@ +3820,5 @@
> +
> +static inline bool
> +IsGateExists(ID aID)
> +{
> +  // ID is unsigned. If it weren't, we'd have needed to check >= 0 too.

Lets do a static assert on that instead of a comment?

@@ +3832,5 @@
> +  kGatesClosedMarker // flag as last element
> +};
> +
> +static void
> +GatesInit()

More descriptive as InitHistogramGates() ?

@@ +3842,5 @@
> +}
> +
> +// Returns true For invalid ID since default is open and it can't be closed.
> +static inline bool
> +IsGateOpen(ID aID)

IsHistogramGateOpen()?

@@ +3848,5 @@
> +  return !IsGateExists(aID) || !gHistogramsClosedGates[aID];
> +}
> +
> +// The external API for telemetry gating
> +nsresult

Do we get any value out of return a nsResult instead of just a bool?

@@ +3849,5 @@
> +}
> +
> +// The external API for telemetry gating
> +nsresult
> +GateOpen(ID aID, bool aIsOpen)

OpenHistogramGate()?

@@ +3852,5 @@
> +nsresult
> +GateOpen(ID aID, bool aIsOpen)
> +{
> +  if (!IsGateExists(aID)) {
> +    return NS_ERROR_ILLEGAL_VALUE;

Should we do a MOZ_ASSERT() here?
Or at least log a warning?

@@ +3971,5 @@
>    // Make the service manager hold a long-lived reference to the service
>    nsCOMPtr<nsITelemetry> telemetryService =
>      do_GetService("@mozilla.org/base/telemetry;1");
>    MOZ_ASSERT(telemetryService);
> +  GatesInit();

I think this belongs into TelemetryImpl::TelemetryImpl().

::: toolkit/components/telemetry/Telemetry.h
@@ +90,5 @@
> + *
> + * @param id - histogram id
> + * @param isOpen - new gate state: true to open the gate, false to close it
> + */
> +nsresult GateOpen(ID id, bool isOpen);

OpenHistogramGate()?
Do we gain anything from returning nsresult instead of just a bool?
Attachment #8692949 - Flags: feedback?(gfritzsche) → feedback+
Assignee: nobody → avihpit
Status: NEW → ASSIGNED
(Assignee)

Comment 11

2 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #10)
> Is inline still relevant?

inline is a suggestion. The compiler can still handle it however it wants. I see no harm with it but I'm fine with removing it too, which I will do.


>> IsGateExists
> Lets do a static assert on that instead of a comment?

/me likes. Will do.


> no need for explicit static here and below.

static here only means the symbol is internal to this translation unit and shouldn't be visible outside of this file. I don't think the functions I marked as static should be visible elsewhere, and the only API I wanted to expose is Telemetry::GateOpen(...). Why don't you want them static?


> You should move this into the top-level anonymous namespace

>> GatesInit() 
> More descriptive as InitHistogramGates() ?

>> IsGateOpen(...)
> IsHistogramGateOpen()?

>> GateOpen(...)
> OpenHistogramGate()?

I chose to put it inside the Telemetry namespace since it works on telemetry histograms, which is also the reason for my decision to not include "Histogram" in the function names (it's implied by the namespace). Exactly the same reason that all the Accumulate functions are within this namespace and don't include "Histogram" as part of the function names. Contextually, it has the same context as Accumulate.

Since I don't understand why you want it, I'll ask the corollary: why do we want it at top level anonymous namespace, and why do we want a different namespace than that of the Accumulate functions?


>> nsresult GateOpen(...)
> Do we get any value out of return a nsResult instead of just a bool?

From the comment at the code:
>> Returns success iff id is of a known histogram (for unknown ones - cannot be closed).

So I think we do get a semantic value, since the function could not do what the caller asked it to. bool communicates a boolean answer to a question, where a success code indicates whether or not the function was able to complete the caller's request (possibly with the reason). So despite the fact that we only have two different return values, nsresults has better semantics here IMO.


>> nsresult GateOpen(...)
> Should we do a MOZ_ASSERT() here?
> Or at least log a warning?

Depends on the practice, and I'm unfamiliar with the practices of the telemetry code. I felt that a return value indicating whether or not the call succeeded would be enough, but I'm happy to use an assertion or a log warning. Just let me know which you prefer, if at all.

IMO, the options here WRT to nsresult/bool and checks are:
- Current code: communicate refusal via return value, caller may choose to handle it.
- Return void, but with runtime assertion/logging.
- All compile time: compile time assertion on the ID, no need for runtime checks (can return void). Though this means we can't call this function with a variable argument, so we'll need to modify GatesInit() too to work on the array directly - less encapsulation but not terrible.



>> static bool gHistogramsClosedGates[HistogramCount] = {0};
> As Vladan said, this should be part of TelemetryHistogram, unless there is a
> good reason against that.

I'm still trying to understand the implications of it.

The reason I chose this approach is that I wanted to avoid initialization cost, complexity and risk. If it's part of TelemetryHistogram then we probably need to modify TelemetryHistogramData.inc, which means we'll need to change the python code which generates it, which means we'll probably want to put the default gate state at histograms.json, which means we need to take care of possibly earlier state of the gate from previous sessions, which means much more complexity and risk.

I might be wrong though. I hope I'm wrong.

When I asked you on vidyo if a "less correct but simpler and safer approach would be acceptable", I meant exactly this part of the patch.
(In reply to Avi Halachmi (:avih) from comment #11)
> > no need for explicit static here and below.
> 
> static here only means the symbol is internal to this translation unit and
> shouldn't be visible outside of this file. I don't think the functions I
> marked as static should be visible elsewhere, and the only API I wanted to
> expose is Telemetry::GateOpen(...). Why don't you want them static?

Moving it into the anonymous namespace has the same effect and we already do this in Telemetry.cpp for internal functions.

> >> GatesInit() 
> > More descriptive as InitHistogramGates() ?
> 
> >> IsGateOpen(...)
> > IsHistogramGateOpen()?
> 
> >> GateOpen(...)
> > OpenHistogramGate()?
> 
> I chose to put it inside the Telemetry namespace since it works on telemetry
> histograms, which is also the reason for my decision to not include
> "Histogram" in the function names (it's implied by the namespace). Exactly
> the same reason that all the Accumulate functions are within this namespace
> and don't include "Histogram" as part of the function names. Contextually,
> it has the same context as Accumulate.

We have pretty different code in the Telemetry namespace, it doesn't exclusively deal with Histograms.
I think its good for readability, dealing with some kind of "Gate" is rather generic.

> Since I don't understand why you want it, I'll ask the corollary: why do we
> want it at top level anonymous namespace, and why do we want a different
> namespace than that of the Accumulate functions?

Telemetry::Accumulate() in turn is a public/interface function, it needs to be in that namespace.
When refactoring this file a bit we gathered the internal, non-public functions mostly in the anonymous namespace.
"static" vs. unnamed namespace is not a big deal, but i think we should stay with the same style.

> >> nsresult GateOpen(...)
> > Do we get any value out of return a nsResult instead of just a bool?
> 
> From the comment at the code:
> >> Returns success iff id is of a known histogram (for unknown ones - cannot be closed).
> 
> So I think we do get a semantic value, since the function could not do what
> the caller asked it to. bool communicates a boolean answer to a question,
> where a success code indicates whether or not the function was able to
> complete the caller's request (possibly with the reason). So despite the
> fact that we only have two different return values, nsresults has better
> semantics here IMO.

Current use only gives success/failure anyway, hence the question - but i don't care strongly.

> >> nsresult GateOpen(...)
> > Should we do a MOZ_ASSERT() here?
> > Or at least log a warning?
> 
> Depends on the practice, and I'm unfamiliar with the practices of the
> telemetry code. I felt that a return value indicating whether or not the
> call succeeded would be enough, but I'm happy to use an assertion or a log
> warning. Just let me know which you prefer, if at all.

As this is definitely an unexpected case i'd suggest:
* MOZ_ASSERT() for debug crash & discovery
* a warning after that for non-debug

> >> static bool gHistogramsClosedGates[HistogramCount] = {0};
> > As Vladan said, this should be part of TelemetryHistogram, unless there is a
> > good reason against that.
> 
> I'm still trying to understand the implications of it.
> 
> The reason I chose this approach is that I wanted to avoid initialization
> cost, complexity and risk. If it's part of TelemetryHistogram then we
> probably need to modify TelemetryHistogramData.inc, which means we'll need
> to change the python code which generates it, which means we'll probably
> want to put the default gate state at histograms.json, which means we need
> to take care of possibly earlier state of the gate from previous sessions,
> which means much more complexity and risk.
> 
> I might be wrong though. I hope I'm wrong.

It sounds like a lot, but its basically just affecting one line of Python and adding a line to the Histogram struct definition.
All other behavior should be basically the same, just different memory location and pattern.
Blocks: 1229104
(Assignee)

Comment 13

2 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #12)
> > >> static bool gHistogramsClosedGates[HistogramCount] = {0};
> > > As Vladan said, this should be part of TelemetryHistogram, unless there is a
> > > good reason against that.
> > 
> > I'm still trying to understand the implications of it.
> > 
> > The reason I chose this approach is that I wanted to avoid initialization
> > cost, complexity and risk. If it's part of TelemetryHistogram then we
> > probably need to modify TelemetryHistogramData.inc, which means we'll need
> > to change the python code which generates it, which means we'll probably
> > want to put the default gate state at histograms.json, which means we need
> > to take care of possibly earlier state of the gate from previous sessions,
> > which means much more complexity and risk.
> > 
> > I might be wrong though. I hope I'm wrong.
> 
> It sounds like a lot, but its basically just affecting one line of Python
> and adding a line to the Histogram struct definition.
> All other behavior should be basically the same, just different memory
> location and pattern.

I still don't think I quite get it, despite you and vladan both suggesting apparently the same thing.

TelemetryHistogram only holds static properties of the histogram (name, number of buckets, etc). It should not be used to store the runtime state of the gate (gHistogramsClosedGates[] - to which this comment refers).

If we wanted to go "correct" all the way, the place to store the runtime state of the gate would be as part of the histogram itself. I.e. modify histogram.cc/h, add it as a member, add Histogram API to change it, and take it into account when updating it (all within class Histogram and possibly some derived classes too).

As far as TelemetryHistogram goes, the correct use for it would be to store the static property of whether or not the gate should be open by default for this histogram, which means replacing kGatesClosedByDefault[].

But when we add a property to TelemetryHistogram, we also have to align it with the static array TelemetryHistogram gHistogramp[] which is generated at build time from histograms.json into TelemetryHistogramData.inc.

We could "add one line" to always add 0 as this value, but then we would still need to keep the initial state someplace, so this doesn't make kGatesClosedByDefault[] go away.

Alternatively and more correctly, we would add the initial state into histograms.json, and update the python parser/generator to take it into account when generating gHistogramp[]. As far as I can tell, this is the only valid interpretation of "As Vladan said, this should be part of TelemetryHistogram".

Another possible interpretation of the request (by gfritzsche andvladan) is that we could reuse TelemetryHistogram to hold both the initial state (propagated from histograms.json) which would also later double as the runtime state for GateOpen(...) calls.

However, we cannot do that since TelemetryHistogram gHistograms[] is rightfully const, so it cannot and should not be used to hold the runtime state of the gates, but there's no other construct which holds TelemetryHistogram values as far as I can tell.

So bottom line, I don't know how to interpret the request of "this should be part of TelemetryHistogram".


Overall, we have two sets of values which we need to access:

1. The initial gate state which the patch puts at kGatesClosedByDefault[], where the only valid alternative is to put it at histograms.json and propagate it to TelemetryHistogram gHistograms[].

2. The runtime state of the gates which the patch puts at gHistogramsClosedGates[], where the only valid correct alternative is to add it as a member to class Histogram, with new APIs for it etc.

1 and 2 above are completely orthogonal, i.e. we could use one or the other or both or none.

I did consider each of them for the patch, and concluded that:
A. Both add zero functionality over the patch.
B. Both add code, scope and risk.
C. But - both are more correct from a system perspective.

I think I'm going to interpret it as 1 only. I.e. propagate the initial state from histogram.json into TelemetryHistogram gHistograms[] and use it instead of kGatesClosedByDefault[], but keep the runtime state at gHistogramsClosedGates[] rather than add gating support to class Histogram.

Please let me know if you want something else.
(Assignee)

Comment 14

2 years ago
Created attachment 8693912 [details] [diff] [review]
part 1 v2: Add telemetry histogram gating support

This addresses most of the comments and changes the following from v1:
- Names changes.
- Removed static and inline.
- Moved non-API things out of the Telemetry namespace.
- static_assert that ID is unsigned.
- Changed the API GateOpen to return void, and added MOZ_ASSERT (in line with other places at this file which don't add warnings/logs).
- Moved initial gate state to histograms.json (and with python to TelemetryHistogramData.inc and to TelemetryHistogram gHistograms[]) instead of at kGatesClosedByDefault[]. This also makes the initialization cleaner.

The changes do not include moving the runtime state of the gates to class Histogram.

After some more consideration, I think that this can still be considered "correct" since the histogram's goal is to collect samples, and the telemetry code controls when samples are allowed to be added to the histograms - the same as it does with CanRecord etc.
Attachment #8692949 - Attachment is obsolete: true
Attachment #8693912 - Flags: review?(gfritzsche)
Attachment #8693912 - Flags: feedback?(vladan.bugzilla)
(Assignee)

Comment 15

2 years ago
Created attachment 8693914 [details] [diff] [review]
part 2 v2: Add telemetry probe for synchronous scroll

Carrying r+ since this is minor.

The only change is that the initial gate state of FX_REFRESH_DRIVER_SYNC_SCROLL_FRAME_DELAY_MS is now at histograms.json.
Attachment #8692952 - Attachment is obsolete: true
Attachment #8693914 - Flags: review+
Attachment #8693914 - Flags: feedback?(vladan.bugzilla)
(Assignee)

Comment 16

2 years ago
Try push https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f75f066dc88

Georg, you asked me to ni? you about something with the tests.
Flags: needinfo?(gfritzsche)
(Assignee)

Comment 17

2 years ago
Comment on attachment 8693912 [details] [diff] [review]
part 1 v2: Add telemetry histogram gating support

Review of attachment 8693912 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.h
@@ +90,5 @@
> + *
> + * @param id - histogram id
> + * @param isOpen - new gate state: true to open the gate, false to close it
> + */
> +void GateOpen(ID id, bool isOpen);

I forgot to change this at v2. I'll change it to HistogramGateOpen (and at Telemetry.cpp and nsGfxScrollFrame.cpp).
(In reply to Avi Halachmi (:avih) from comment #16)
> Try push https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f75f066dc88
> 
> Georg, you asked me to ni? you about something with the tests.

We can extend our xpcshell tests (test_nsITelemetry.js) with a test function that covers this if we expose the recording state to JS.

That means:
* Adding a function to set the recording state of a histogram to the JS interface.
  This will be similar to this:
  https://dxr.mozilla.org/mozilla-central/rev/a18630f9ab42ddfde03ba8c7757a42069c48c7ed/toolkit/components/telemetry/nsITelemetry.idl#303
  ... but without the [optional_argc] part etc.
* Changes to the interface mean you have to rev/change it's uuid:
  mach update-uuids nsITelemetry
* Implement the function similar to the above example (without the optArgCount):
  https://dxr.mozilla.org/mozilla-central/rev/a18630f9ab42ddfde03ba8c7757a42069c48c7ed/toolkit/components/telemetry/Telemetry.cpp#2167
  ... which would basically just call HistogramGateOpen().
* Add a test function to test_nsITelemetry.js that gives us coverage, by say:
  - testing both a default gated & non-gated histogram
  - .clear() the tested histograms at test start
  - flipping gating states around and then .add() to the histograms
  - check e.g. .snapshot().sum to confirm gating behaves as expected
Flags: needinfo?(gfritzsche)
Comment on attachment 8693912 [details] [diff] [review]
part 1 v2: Add telemetry histogram gating support

Review of attachment 8693912 [details] [diff] [review]:
-----------------------------------------------------------------

Two high-level points:
* Reading this today, i don't think the "gating" terminology is best in the public interface. Can we use "recording" instead?
  E.g. names like "initially_recording", "SetHistogramRecordingEnabled(id)".
* In my build, going by TelemetryHistogramEnums.h, the gHistograms table has 1615 entries.
  If we only have "very few" histograms that are closed by default, we should not add a flag for each of them.
  Instead we should only have data for the gating for the few non-default histograms - either constant data as your previous approach or generated from Histograms.json.
(Assignee)

Comment 20

2 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> Two high-level points:
> * Reading this today, i don't think the "gating" terminology is best in the
> public interface. Can we use "recording" instead?
>   E.g. names like "initially_recording", "SetHistogramRecordingEnabled(id)".

We already have "recording" terminology, e.g. in CanRecord*, and gating is a well known terminology as a transient control where some parameter controls whether or not data passes in another channel - https://en.wikipedia.org/wiki/Gating_%28telecommunication%29

> * In my build, going by TelemetryHistogramEnums.h, the gHistograms table has
> 1615 entries.
>   If we only have "very few" histograms that are closed by default, we
> should not add a flag for each of them.
>   Instead we should only have data for the gating for the few non-default
> histograms - either constant data as your previous approach or generated
> from Histograms.json.

I don't feel too strongly either way, but this means that we're giving up the nice histograms.json interface in order for gHistograms to use less 1600 bools in memory. I'm sympathetic with the cause, but IMO some might argue that the nice histograms.json interface is worth more.

OTOH, while it is possible to make the python code generate another construct more similar to the minimal memory usage of the earlier kGatesClosedByDefault[], it's probably not worth the investment in this context.

I can live with either way on both issues.

Let vladan decide.
Flags: needinfo?(vladan.bugzilla)
(In reply to Avi Halachmi (:avih) from comment #20)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> > Two high-level points:
> > * Reading this today, i don't think the "gating" terminology is best in the
> > public interface. Can we use "recording" instead?
> >   E.g. names like "initially_recording", "SetHistogramRecordingEnabled(id)".
> 
> We already have "recording" terminology, e.g. in CanRecord*, and gating is a
> well known terminology as a transient control where some parameter controls
> whether or not data passes in another channel -
> https://en.wikipedia.org/wiki/Gating_%28telecommunication%29

"Gating" is definitely more specialized, "recording" should be immediately.
I didn't know what "gating" was supposed to mean here without reading more on the context.

> > * In my build, going by TelemetryHistogramEnums.h, the gHistograms table has
> > 1615 entries.
> >   If we only have "very few" histograms that are closed by default, we
> > should not add a flag for each of them.
> >   Instead we should only have data for the gating for the few non-default
> > histograms - either constant data as your previous approach or generated
> > from Histograms.json.
> 
> I don't feel too strongly either way, but this means that we're giving up
> the nice histograms.json interface in order for gHistograms to use less 1600
> bools in memory. I'm sympathetic with the cause, but IMO some might argue
> that the nice histograms.json interface is worth more.
> 
> OTOH, while it is possible to make the python code generate another
> construct more similar to the minimal memory usage of the earlier
> kGatesClosedByDefault[], it's probably not worth the investment in this
> context.

Yeah, we could do a "nice" Histograms.json property later, but all we are doing now is default-disabling one single probe.
If we need to configure more we could generate a separate header with the "histograms that have recording disabled by default" information later, matching what kGatesClosedByDefault does here.
Attachment #8693912 - Flags: review?(gfritzsche)
(Assignee)

Comment 22

2 years ago
Will do.
(Assignee)

Comment 23

2 years ago
Created attachment 8694765 [details] [diff] [review]
part 1 v3: Add telemetry histogram gating support

Compared to v2, this changes the following:
- Removes the initial state from histogram.json back to an array at Telemetry.cpp.
- Names changes as Georg suggested on IRC.
Attachment #8693912 - Attachment is obsolete: true
Attachment #8693912 - Flags: feedback?(vladan.bugzilla)
Flags: needinfo?(vladan.bugzilla)
Attachment #8694765 - Flags: review?(gfritzsche)
Attachment #8694765 - Flags: feedback?(vladan.bugzilla)
(Assignee)

Comment 24

2 years ago
Created attachment 8694766 [details] [diff] [review]
Part 2 v3: Add telemetry probe for synchronous scroll. r=kats

Carrying r+ since it's identical to v1 except for the names changes.
Attachment #8693914 - Attachment is obsolete: true
Attachment #8693914 - Flags: feedback?(vladan.bugzilla)
Attachment #8694766 - Flags: review+
(Assignee)

Comment 25

2 years ago
Created attachment 8694900 [details] [diff] [review]
Part 3: Telemetry recording enabled states: Use atomics

Following interesting discussions on IRC with botond, sfink and froynd, it was concluded that:

1. Due to potential simultaneous access from multiple threads, the access need to be guarded regardless of everything else.

2. Because we don't care about updates order or delayed observability, we can use Relaxed MemoryOrdering.

3. Initialization of the static states array is guaranteed to have all elements as false bool values.


This patch applies these understandings and makes the API and internal calls thread safe.
Attachment #8694900 - Flags: review?(nfroyd)
Attachment #8694900 - Flags: feedback?(gfritzsche)
Comment on attachment 8694900 [details] [diff] [review]
Part 3: Telemetry recording enabled states: Use atomics

Review of attachment 8694900 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +914,5 @@
>  // - Potential memory saving: store each value as one bit.
> +// - Thread safety: MemoryOrdering of Relaxed was chosen since we care about
> +//   performance but don't care much about delayed observability of the changes
> +//   to the recording states. Also, no update depends on ordering of the
> +//   changes or on a prior values of any of the states.

Thank you for justifying the Relaxed ordering in a comment!
Attachment #8694900 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 27

2 years ago
Created attachment 8695130 [details] [diff] [review]
Part 4: expose the API to js

So with this patch, I can now access this API from JS - tested in Firefox browser.js, where I controlled few different histograms and verified that their recording are indeed getting enabled/disabled (and their histograms accumulations reflect this at about:telemetry.

However, I just cannot get it to work from the xpcshell test.

The calls to enable/disable recording definitely arrive to the code in part 1 of the patch and end up changing the value at control array (the one which became atomic in part 3), however, none of the add() calls seem to go through the code which checks if the gate is open or close.

I tested this with count and with exponential histograms, and the add calls never get blocked by closed gate.

I hope to be able to continue debugging it before Orlando.
Attachment #8695130 - Flags: review?(gfritzsche)
Comment on attachment 8694765 [details] [diff] [review]
part 1 v3: Add telemetry histogram gating support

Review of attachment 8694765 [details] [diff] [review]:
-----------------------------------------------------------------

Are you confident that this covers all Accumulate()/Add() paths?
I'm not so sure about e.g. the JSKeyedHistogram_Add() -> KeyedHistogram::Add() path, i can check this in more detail later if needed.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +897,5 @@
>  
>  bool
> +IsHistogramEnumId(Telemetry::ID aID)
> +{
> +  static_assert(((Telemetry::ID)-1 > 0), "ID is signed. Test also for aID >= 0 .");

static_assert(mozilla::IsUnsigned<decltype(aID)>::value, "ID should be unsigned.");
https://dxr.mozilla.org/mozilla-central/rev/eec9a0bfd929a68237f0ef799a9d8f28c4749296/mfbt/TypeTraits.h#537

@@ +910,5 @@
> +
> +// gHistogramsRecordingDisabled: true means disabled. The default is enabled.
> +// - Potential enhancement: change bool to a scalar type and allow nested
> +//   tracking as well by inc/dec the value. For now, we don't need it.
> +// - Potential memory saving: store each value as one bit.

Nit: Do we need to talk about possible future changes?
If we don't do them these parts of the comment just take time to read :)

@@ +926,5 @@
> +const Telemetry::ID kRecordingListDoneMarker = Telemetry::HistogramCount;
> +// List of histogram IDs which should have recording disabled initially.
> +const Telemetry::ID kRecordingInitiallyDisabledIDs[] = {
> +
> +  kRecordingListDoneMarker // marker as end of list

I don't think we need this marker.

@@ +934,5 @@
> +InitHistogramRecordingEnabled()
> +{
> +  Telemetry::ID id;
> +  for (size_t i = 0; (id = kRecordingInitiallyDisabledIDs[i]) != kRecordingListDoneMarker; i++) {
> +    SetHistogramRecordingEnabled(id, false);

Lets make this a bit more readable:

const size_t length = mozilla::ArrayLength(kRecordingInitiallyDisabledIDs);
for (size_t i = 0; i < length; ++i) {
  ...

@@ +938,5 @@
> +    SetHistogramRecordingEnabled(id, false);
> +  }
> +}
> +
> +// Returns true For invalid ID since default is enabled and it can't be disabled.

Nit: Lower-case "for".

@@ +3854,5 @@
> +void
> +SetHistogramRecordingEnabled(ID aID, bool aEnable)
> +{
> +  if (!IsHistogramEnumId(aID)) {
> +    MOZ_ASSERT(0, "Telemetry::SetHistogramRecordingEnabled(...) must be used with an enum id");

MOZ_ASSERT(false, ...)
Attachment #8694765 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8694900 [details] [diff] [review]
Part 3: Telemetry recording enabled states: Use atomics

Review of attachment 8694900 [details] [diff] [review]:
-----------------------------------------------------------------

Do we expect any significant performance problems here?
Given the context in bug 1141565 i assume no?
Attachment #8694900 - Flags: feedback?(gfritzsche) → feedback+
Comment on attachment 8695130 [details] [diff] [review]
Part 4: expose the API to js

Review of attachment 8695130 [details] [diff] [review]:
-----------------------------------------------------------------

Please also update the wiki:
https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsITelemetry

::: toolkit/components/telemetry/Telemetry.cpp
@@ +2319,5 @@
>    return NS_OK;
>  }
>  
> +NS_IMETHODIMP
> +TelemetryImpl::SetHistogramRecordingEnabled(const nsACString &name, bool aEnable)

Nit: aEnabled

@@ +2324,5 @@
> +{
> +  Telemetry::ID id;
> +  nsresult rv = GetHistogramEnumId(PromiseFlatCString(name).get(), &id);
> +
> +  if (rv == NS_OK) {

if (NS_SUCCEEDED(rv)) {
 ...

@@ +2327,5 @@
> +
> +  if (rv == NS_OK) {
> +    Telemetry::SetHistogramRecordingEnabled(id, aEnable);
> +  } else if (rv == NS_ERROR_INVALID_ARG) {
> +    // We assert on invalid argument but not on other failures, e.g. too early call (NS_ERROR_FAILURE).

Nit: "arguments".

@@ +2328,5 @@
> +  if (rv == NS_OK) {
> +    Telemetry::SetHistogramRecordingEnabled(id, aEnable);
> +  } else if (rv == NS_ERROR_INVALID_ARG) {
> +    // We assert on invalid argument but not on other failures, e.g. too early call (NS_ERROR_FAILURE).
> +    MOZ_ASSERT(0, "SetHistogramRecordingEnabled(...) must be used with an enum name");

Nit:
MOZ_ASSERT(false, "SetHistogramRecordingEnabled() must be used with a known Histogram ID.");

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +329,5 @@
>  
>    /**
> +   * Enable/disable recording for this histogram in runtime.
> +   * Recording is enabled by default, unless listed at kRecordingInitiallyDisabledIDs[].
> +   * name must be of a valid telemetry enum, or else will assert.

Nit:
Casing etc.: "Name must be a valid Histogram identifier, otherwise an assertion will be triggered."

@@ +334,5 @@
> +   *
> +   * @param name - histogram name
> +   * @param enabled - whether or not to enable recording from now on.
> +   */
> +  void setHistogramRecordingEnabled(in ACString name, in boolean enable);

Nit: "enabled".
Attachment #8695130 - Flags: review?(gfritzsche) → review+
(In reply to Georg Fritzsche [:gfritzsche] from comment #30)
> Please also update the wiki:
> https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/
> Interface/nsITelemetry

We should also update the general "Adding a probe" article with info on the "recording enabled" state and the defaults:
https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe
(Assignee)

Comment 32

2 years ago
So, here's the high level issue I have.

The current code at part 1 works very efficiently for Accumulate calls, but it works only for Accumulate calls. It's efficient since these calls have (or already calculate anyway) the histogram ID, and once we have the ID, it's a matter of only checking a value in an array.

This is enough for our current need for this bug (where the samples we need to "moderate" are collected using Accumulate).

However, if we want to test it from JS, then the current JS API is such that it doesn't contain the histogram ID when calling add(v). The ID is also not stored as part of the histogram itself.

So if we want to apply this gating system also to JS API, we'll need to get the histogram name (which the histogram does store), and then the ID from the name. This is O(1), but it's not negligible in terms of performance, and it's wasted for the vast majority of cases (where gating is not used at all).

So this is an approach I prefer not to take.

Another option is to create a new JS API, something along the lines of "add_gated" (we can fight over the name later), where this would be exactly what the current "add" does, but with the overhead mentioned earlier of checking whether the gate is open. So at least this overhead would be limited to places which actually need it.

Another path which I explored is adding this state to class Histogram itself. This way we don't need to maintain an array (of atomics), and we don't need the histogram ID in order to change the gate state. So it sounds great in theory.

The problem here is that it's not trivial and carries a non negligible risk both now and for the future. This is because class Histogram (and its derivatives) have quite a few APIs which can add samples, and we'd need to modify all of them (even if we only use a small subset of them, we'd still need to make it work consistently).

Some of these APIs call others APIs, some call others conditionally, some have side effects which need to be avoided (e.g. LinearHistogram updates its statistics and accumulates, so blocking only accumulate will result in incorrect statistics).

And it's also an issue for the future since many of the samples-adding APIs are virtual, and whoever adds future subclasses would need to handle the recording-enabled state explicitly as well.

This would be a very big overhead for our current needs and an unrequired risk and complexity IMO.

Overall, I tend to prefer the second approach where we add a JS API which takes recording-enabled state into account. This would be useful not only for testing but also for cases where JS needs gated samples, and it will not have any overhead as long as gating is not required.

Georg, what do you think?
Flags: needinfo?(gfritzsche)
(Assignee)

Comment 33

2 years ago
Created attachment 8695698 [details] [diff] [review]
part 1 v4 (inc. folded part3): Add telemetry histogram gating support

Carrying r+ since I've addressed all the concerns at comment 28. Feel free to r- if it needs further modifications.

Notes:

> Are you confident that this covers all Accumulate()/Add() paths?

I'm confident that it doesn't. This part covers the non-javascript access via Accumulate. JS exposure is added at the updated part 4 below.


> static_assert(mozilla::IsUnsigned<decltype(aID)>::value, "ID should be unsigned.");
> https://dxr.mozilla.org/mozilla-central/rev/eec9a0bfd929a68237f0ef799a9d8f28c4749296/mfbt/TypeTraits.h#537

The very line which you pointed at reads:
> IsUnsigned determines whether a type is an unsigned arithmetic type.

But Telemetry::ID is not an arithmetic type - It's an enum. As such, IsUnsigned fails to recognize that it's unsigned (and does recognize correctly, e.g. int or unsigned int).

I've left the original check but with an updated message. I also tested that it works correctly when ID is signed (i.e. it fails), and it does build and passes all tests on all platforms (try build which I posted here earlier).


> I don't think we need this marker.

We do if we want this part to compile on its own and not depend on later parts of the patch (else it's an empty array which fails to compile). I solved it by moving the test histogram at histograms.json to this part and using its ID as a single item of the array.
Attachment #8694765 - Attachment is obsolete: true
Attachment #8694765 - Flags: feedback?(vladan.bugzilla)
Attachment #8695698 - Flags: review+
(Assignee)

Comment 34

2 years ago
Created attachment 8695699 [details] [diff] [review]
Part 2 v4: Add telemetry probe for synchronous scroll

Carrying r+.
Attachment #8694766 - Attachment is obsolete: true
Attachment #8695699 - Flags: review+
(Assignee)

Comment 35

2 years ago
Created attachment 8695704 [details] [diff] [review]
part 4 v2: Expose RecordingEnabled to javascript

>> +    // We assert on invalid argument but not on other failures, e.g. too early call (NS_ERROR_FAILURE).

> Nit: "arguments".

No, we only assert on an invalid name (id) argument but not on other failures or other arguments.


This enhances the previous part 4, so I'm not carrying r+.

I took the approach I suggested at comment 32. I.e. I didn't want to hurt the performance of the vast majority of cases where don't use this feature (for now, the JS api is only used at the test in part 5), so I added another API add_gated (both for normal and keyed histograms) and the overhead applies only when using these.

I'd use any names which you'd want, but I'd appreciate if you could first review and comment on the approach, its validity or better alternatives, and leave the nits to last. Thanks.
Attachment #8694900 - Attachment is obsolete: true
Attachment #8695130 - Attachment is obsolete: true
Flags: needinfo?(gfritzsche)
Attachment #8695704 - Flags: review?(gfritzsche)
(Assignee)

Comment 36

2 years ago
Created attachment 8695705 [details] [diff] [review]
Part 5: Add tests
Attachment #8695705 - Flags: review?(gfritzsche)
(In reply to Avi Halachmi (:avih) from comment #32)
> Overall, I tend to prefer the second approach where we add a JS API which
> takes recording-enabled state into account. This would be useful not only
> for testing but also for cases where JS needs gated samples, and it will not
> have any overhead as long as gating is not required.
> 
> Georg, what do you think?

I don't think we should add special-cased methods here, that's a major pitfall for users of Telemetry. 
Let's investigate how we can solve that "efficiently enough", i'll try to get this part figured out as soon as i can.
(In reply to Avi Halachmi (:avih) from comment #33)
> > static_assert(mozilla::IsUnsigned<decltype(aID)>::value, "ID should be unsigned.");
> > https://dxr.mozilla.org/mozilla-central/rev/eec9a0bfd929a68237f0ef799a9d8f28c4749296/mfbt/TypeTraits.h#537
> 
> The very line which you pointed at reads:
> > IsUnsigned determines whether a type is an unsigned arithmetic type.
> 
> But Telemetry::ID is not an arithmetic type - It's an enum. As such,
> IsUnsigned fails to recognize that it's unsigned (and does recognize
> correctly, e.g. int or unsigned int).

Ok, and it looks like we are not able to rely on std::underlying_type<E> yet.
Comment on attachment 8695698 [details] [diff] [review]
part 1 v4 (inc. folded part3): Add telemetry histogram gating support

I only r+d part 4 so far.
Attachment #8695698 - Flags: review+ → review-
(Assignee)

Comment 40

2 years ago
Another thing we could do is to add a flag or value to class Histogram which is initialized to false by the histogram itself, but once initialized only the telemetry code uses it and updates it.

This way we can use this value every time we want to consider/update the gate state. It would be O(1) and also as efficient as it gets, and it will make the current patches much simpler, and it would not add burden to whoever touches the code of class histogram.

The downside is that class Histogram won't handle it by itself beyond init.

We could also consider it as context (and even use a real opaque void* pointer which is initialized to 0), so that would make it semantically acceptable as well.
(Assignee)

Comment 41

2 years ago
Created attachment 8695944 [details] [diff] [review]
bug1228147.alt1.commulative.patch

This is a proof of concept implementation for the idea at comment 40.

It's a cumulative patch with everything including the tests and the new probe.

It's very performant (not even hash access during normal use), and much less code than before, and with much less different code paths.

The only issue - the interface to class Histogram is not very elegant. But other than that, IMO it ticks all the boxes.
Comment on attachment 8695944 [details] [diff] [review]
bug1228147.alt1.commulative.patch

Review of attachment 8695944 [details] [diff] [review]:
-----------------------------------------------------------------

We sat down in Orlando and talked over the approaches.
Here are the high-level points, a detailed review is coming up later:
* I'm ok with the recording implementation for now, later i'd like it to be contained in Histogram.cc but that would require some cleanup work first.
* We should coordinate with jseward to not blow up his work in bug 1141565.
* We should have at least a proper API Histogram::SetRecordingEnabled()/GetRecordingEnabled().
Attachment #8695704 - Flags: review?(gfritzsche)
Attachment #8695705 - Flags: review?(gfritzsche)
Attachment #8695944 - Flags: review?(gfritzsche)
Attachment #8695698 - Attachment is obsolete: true
Attachment #8695699 - Attachment is obsolete: true
Attachment #8695704 - Attachment is obsolete: true
Attachment #8695705 - Attachment is obsolete: true
Comment on attachment 8695944 [details] [diff] [review]
bug1228147.alt1.commulative.patch

Review of attachment 8695944 [details] [diff] [review]:
-----------------------------------------------------------------

To re-summarize:
* I'm ok with the recording implementation for now, later i'd like it to be contained in Histogram.cc but that would require some cleanup work first.
* We should have at least a proper API Histogram::SetRecordingEnabled()/GetRecordingEnabled()
* Per comment 3 the graphics probe has r=kats
* I'll flag jseward on to avoid TSan issues here (see bug 1141565)

::: layout/generic/nsGfxScrollFrame.cpp
@@ +1509,5 @@
>      , mLastRefreshTime(aStartTime)
>      , mCallee(nullptr)
>      , mOneDevicePixelInAppUnits(aPresContext->DevPixelsToAppUnits(1))
>    {
> +    Telemetry::SetHistogramRecordingEnabled(Telemetry::FX_REFRESH_DRIVER_SYNC_SCROLL_FRAME_DELAY_MS, true);

The SetHistogramRecordingEnabled() lines in this file are pretty long, should we wrap them?

::: toolkit/components/telemetry/Histograms.json
@@ +4136,5 @@
> +    "kind": "exponential",
> +    "high": "10000",
> +    "n_buckets": 50,
> +    "bug_numbers": [1228147],
> +    "description": "Delay in ms between the target and the actual handling time of the frame at refresh driver while scrolling synchronously."

Add to the description that this will initially not be recording.

@@ +5060,5 @@
>    },
> +  "TELEMETRY_TEST_COUNT_INIT_NO_RECORD": {
> +    "expires_in_version": "never",
> +    "kind": "count",
> +    "description": "a testing histogram; not meant to be touched"

Please add to the description that this will initially not be recording.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +840,5 @@
>    nsresult GetJSSnapshot(JSContext* cx, JS::Handle<JSObject*> obj,
>                           bool subsession, bool clearSubsession);
>  
> +  void SetRecordingEnabled(bool aEnabled) { mRecordingEnabled = aEnabled; };
> +  bool IsRecordingEnabled() { return mRecordingEnabled; };

bool IsRecordingEnabled() const { ...

@@ +905,5 @@
> +  static_assert(((Telemetry::ID)-1 > 0), "ID should be unsigned.");
> +  return aID < Telemetry::HistogramCount;
> +}
> +
> +// List of histogram IDs which should have recording disabled initially.

We can't put keyed histogram ids here, otherwise we get surprising behavior.
Lets either note this or support keyed histograms properly for default behavior.

@@ +923,5 @@
> +  }
> +}
> +
> +bool
> +IsHistogramRecordingEnabled(Histogram &h)

This should become Histogram::IsRecordingEnabled().

@@ +929,5 @@
> +  return !(h.flags() & Histogram::Flags::kCustomUserFlag);
> +}
> +
> +void
> +SetHistogramRecordingEnabled_internal(Histogram &h, bool aEnabled)

This should become Histogram::SetRecordingEnabled().

@@ +3870,5 @@
> +    MOZ_ASSERT(false, "Telemetry::SetHistogramRecordingEnabled(...) must be used with an enum id");
> +    return;
> +  }
> +
> +  Histogram *h;

You can branch by gHistograms[aID].keyed here.

::: toolkit/components/telemetry/Telemetry.h
@@ +83,5 @@
>   */
>  void AccumulateTimeDelta(ID id, TimeStamp start, TimeStamp end = TimeStamp::Now());
>  
>  /**
> + * Enable/disable recording for this histogram in runtime.

Nit: "at runtime"

@@ +85,5 @@
>  
>  /**
> + * Enable/disable recording for this histogram in runtime.
> + * Recording is enabled by default, unless listed at kRecordingInitiallyDisabledIDs[].
> + * id must be a valid telemetry enum, or else will assert.

"..., otherwise an assertion is triggered."?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +331,5 @@
> +   * Enable/disable recording for this histogram in runtime.
> +   * Recording is enabled by default, unless listed at kRecordingInitiallyDisabledIDs[].
> +   * Name must be a valid Histogram identifier, otherwise an assertion will be triggered.
> +   *
> +   * @param name - histogram name

Lets call the param "id" / "histogram id" for consistency with other functions.

::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +419,5 @@
>    Assert.equal(orig.sum + 1, h.snapshot().sum,
>                 "Histogram value should have incremented by 1 due to recording.");
> +
> +
> +  // check that a "normal" histogram respects recording-enabled on/off

I think the recording state coverage should go into a separate test function.

@@ +669,5 @@
>    h.clear();
>    h.add(TEST_KEY, 1);
>    Assert.equal(h.snapshot(TEST_KEY).sum, 1);
> +
> +  // Check RecordingEnabled for keyed histograms

I think the recording state coverage should go into a separate test function.
Julian, if we are adding a flag or boolean to Histogram instances (in histogram.h/.cc) as in attachment 8695944 [details] [diff] [review] and control recording per Histogram based on that...
Will that cause new TSan issues? Should we change something here?
Flags: needinfo?(jseward)
Attachment #8695944 - Flags: review?(gfritzsche)
(Assignee)

Comment 45

2 years ago
Created attachment 8698501 [details] [diff] [review]
bug1228147.alt1.comulative.v2.patch

Carrying r+ from kats on the probe usage and from froydnj on the use of atomics.


Addressed the review comments with the following exceptions:

> Add to the description that this will initially not be recording.

As agreed on IRC, while I didn't mind adding it to the test histogram[s] descriptions (despite it already being part of their ID), it's irrelevant implementation detail for the "production" histogram, which is also mostly already implied by the description ("while scrolling...").


> We can't put keyed histogram ids here, otherwise we get surprising behavior.
> Lets either note this or support keyed histograms properly for default behavior.

The code already handled keyed histograms init the same as normal ones, hence the original comment is correct and no exception is required. However, I found a bug with keyed histograms, which I fixed and also added a test for it (see below).


> I think the recording state coverage should go into a separate test function.

Moved - all the RecordingEnabled tests for normal and keyed histograms to a single new test: test_histogram_recording_enabled


Other changes:
- Used atomic for both Histogram's and KeyedHistogram's recording state (same as patch v3 does).
- Added a new test keyed histogram: TELEMETRY_TEST_KEYED_COUNT_INIT_NO_RECORD
- Added tests for keyed histogram which has recording disabled by default.
- Fixed a bug where keyed histogram recording-disabled init failed - since it depends implicitly on sTelemetry already existing, but the previous init happened while instantiating it and therefore it wasn't yet available.

Considerations:
- Naming conventions and setter/getter body at histogram.h - according to local conventions.
- recording_enabled_ init at Initialize() rather than CTOR to avoid duplication (there are two CTORs).
- Calling InitHistogramRecordingEnabled() after instantiating sTelemetry rather than inside TelemetryImpl::TelemetryImpl since overall it's cleaner use of existing APIs (otherwise some APIs will need to be partially reimplemented privately).
Attachment #8695944 - Attachment is obsolete: true
Attachment #8698501 - Flags: review?(gfritzsche)
Comment on attachment 8698501 [details] [diff] [review]
bug1228147.alt1.comulative.v2.patch

Review of attachment 8698501 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the below addressed.

::: ipc/chromium/src/base/histogram.cc
@@ +597,5 @@
>    size_t maximal_bucket_count = declared_max_ - declared_min_ + 2;
>    DCHECK_LE(bucket_count_, maximal_bucket_count);
>    DCHECK_EQ(0, ranges_[0]);
>    ranges_[bucket_count_] = kSampleType_MAX;
> +  recording_enabled_ = true;

This should be in the two constructors initialization list instead.

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3866,5 @@
> +    if (keyed) {
> +      keyed->SetRecordingEnabled(aEnabled);
> +      return;
> +    }
> +

Nit: redundant empty line.

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +331,5 @@
> +   * Enable/disable recording for this histogram at runtime.
> +   * Recording is enabled by default, unless listed at kRecordingInitiallyDisabledIDs[].
> +   * Name must be a valid Histogram identifier, otherwise an assertion will be triggered.
> +   *
> +   * @param id - unique identifier from TelemetryHistograms.h

Nit: "from Histograms.json"? There are no headers or enums in JS.

::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +651,5 @@
> +  h.add(1);
> +  Assert.equal(orig.sum + 1, h.snapshot().sum,
> +              "add should record by default.");
> +
> +              // Check that when recording is disabled - add is ignored

Nit: Indentation is off.

@@ +676,5 @@
> +  h.add(1);
> +  Assert.equal(orig.sum + 1, h.snapshot().sum,
> +               "When recording is enabled add should record.");
> +
> +  // restore to disabled

Nit: "Restore"

@@ +682,5 @@
> +  h.add(1);
> +  Assert.equal(orig.sum + 1, h.snapshot().sum,
> +               "When recording is disabled add should not record.");
> +
> +  // Keyed Histograms

This doesn't depend on the above testing code really.
Let's move it to a separate function (say test_keyed_histogram_recording_enabled()) for easier reading & reasoning.
Attachment #8698501 - Flags: review?(gfritzsche) → review+
(In reply to Georg Fritzsche [:gfritzsche] [away Dec 19 - Jan 3] from comment #44)
> Julian, if we are adding a flag or boolean to Histogram instances (in
> histogram.h/.cc) as in attachment 8695944 [details] [diff] [review] and
> control recording per Histogram based on that...
> Will that cause new TSan issues? Should we change something here?

I think that either it won't cause new issues, or they will be relatively 
easy to fix.  What I propose is that this bug lands first, and then I redo
the Histogram de-racing bug (bug 1141565) on top of these changes.  This will
also make it easier to do TSan verification of the combined result.
Flags: needinfo?(jseward)
Blocks: 1141565
(Assignee)

Comment 48

2 years ago
Created attachment 8698992 [details] [diff] [review]
bug1228147.part1.v5.patch

Thanks for he reviews and patience.

Addressed all comments.

Carrying r+ from gfritzsche and froydnj.
Attachment #8698501 - Attachment is obsolete: true
Attachment #8698992 - Flags: review+
(Assignee)

Comment 49

2 years ago
Created attachment 8698993 [details] [diff] [review]
bug1228147.part2.v5.patch

Carrying r+ from kats.
Attachment #8698993 - Flags: review+

Comment 50

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a2f45c8d25d
https://hg.mozilla.org/integration/mozilla-inbound/rev/35b4ba91e093
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/ad01f1203ae4 for build bustage:

https://treeherder.mozilla.org/logviewer.html#?job_id=18653141&repo=mozilla-inbound
Flags: needinfo?(avihpit)
(Assignee)

Comment 52

2 years ago
Yeah, sorry, my fault.

I did have a green try push before fixing the nits, and I did build each part independently locally after handling the nits, but apparently incorrect position at the CTOR init list doesn't bother the windows build, but does bust others.

I'll fix and re-push.
Flags: needinfo?(avihpit)
(Assignee)

Comment 53

2 years ago
Created attachment 8699251 [details] [diff] [review]
bug1228147.part1.v6.patch

Carrying r+.

Moved recording_enabled_ to be the last member, which also makes it private.

(from few hours ago) https://treeherder.mozilla.org/#/jobs?repo=try&revision=cde9c70f40b0
Attachment #8698992 - Attachment is obsolete: true
Attachment #8699251 - Flags: review+

Comment 54

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/d32a676f1c39
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f267491ef2a

Comment 55

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d32a676f1c39
https://hg.mozilla.org/mozilla-central/rev/8f267491ef2a
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Avi: can you request uplift of these probes to Beta
(Assignee)

Comment 57

2 years ago
(thanks to jmaher with try pushes to beta/aurora).

Aurora try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bec3956caa4b
Beta try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fabfc69b53aa

Aurora: the patches applied cleanly, with some offsets.

Beta: part 2 rejected but was trivial to fix (the earlier entry in histograms.json is the same on all trees but doesn't have the bug_number field in beta - and does have it on Aurora/Nightly).

Once the try pushes are green, I'll post the uplift requests.

Though maybe I'll wait a week or so to make sure nothing backfires with these patches on Nightly first.
(Assignee)

Comment 58

2 years ago
Both Beta/Aurora try pushes fail to build. As far as I can tell it's not due to the patch itself, but I can't really tell.

Apparently also we don't really have a system to test patches before uplift requests, I filed bug 1233740 for this.

I'll wait for mid next week to make sure everything is clean and working on nightly, and then request the uplift.

Meanwhile, I'll try to find out how to test the patch on Beta/Aurora too.
(Assignee)

Comment 59

a year ago
Created attachment 8701137 [details] [diff] [review]
bug1228147.part2-for-beta.patch

Identical changes to the earlier part2.v5, but with minor context difference to apply cleanly.
Attachment #8701137 - Flags: review+
(Assignee)

Comment 60

a year ago
Comment on attachment 8699251 [details] [diff] [review]
bug1228147.part1.v6.patch

Approval Request Comment

This patch (part 1) is meaningless without part 2.

[Feature/regressing bug #]:
Bug 1228147 (this bug).

[User impact if declined]:
None, but this patch will allow us to evaluate performance differences between e10s and no-e10s builds while scrolling, which is highly valuable data for us (Mozilla).

[Describe test coverage new/current, TreeHerder]:
Extensive tests exist at part 1 of this patch.

[Risks and why]:
Risk should be low. Both code complexity size, and also proven stability and effectiveness on m-c for some days.

[String/UUID change made/needed]:
UUID change at toolkit/components/telemetry/nsITelemetry.idl .

The UUID change is not mandatory for the patch to work, but it is mandatory for the tests to work. Also, reverting the UUID change only for beta would mean landing beta code which is not identical to the m-c code, which would increase some risks. But it's possible if we'll have to.
Attachment #8699251 - Flags: approval-mozilla-beta?
Attachment #8699251 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 61

a year ago
Comment on attachment 8698993 [details] [diff] [review]
bug1228147.part2.v5.patch

Approval Request Comment

This patch (part 2) requires part 1.

[Feature/regressing bug #]:
Bug 1228147 (this bug).

[User impact if declined]:
None, but this patch will allow us to evaluate performance differences between e10s and no-e10s builds while scrolling, which is highly valuable data for us (Mozilla).

[Describe test coverage new/current, TreeHerder]:
Extensive tests exist at part 1 of this patch.

[Risks and why]:
Risk should be low. Both code complexity size, and also proven stability and effectiveness on m-c for some days.

[String/UUID change made/needed]:
None at this part.
Attachment #8698993 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 62

a year ago
Comment on attachment 8701137 [details] [diff] [review]
bug1228147.part2-for-beta.patch

Approval Request Comment

This patch (part 2) requires part 1.

[Feature/regressing bug #]:
Bug 1228147 (this bug).

[User impact if declined]:
None, but this patch will allow us to evaluate performance differences between e10s and no-e10s builds while scrolling, which is highly valuable data for us (Mozilla).

[Describe test coverage new/current, TreeHerder]:
Extensive tests exist at part 1 of this patch.

[Risks and why]:
Risk should be low. Both code complexity size, and also proven stability and effectiveness on m-c for some days.

[String/UUID change made/needed]:
None at this part.
Attachment #8701137 - Flags: approval-mozilla-beta?
Vladan: I am concerned about possible impact of this change (without sufficient stabilization on Nightly/Aurora) on Beta44 performance. 

1. How often would this data be sent to UT server? I would expect Firefox end-users to use scrolling feature almost all the time. Could adding a telemetry probe to that oft-used action (and if probe is enabled by default) could possibly hurt the scrolling performance? 

2. I would like us to limit the # of telemetry probes we uplift in Beta cycle and the amount of code change here makes me uncomfortable.

Could we wait for this to ride the trains from Aurora45 -> Beta45?
Flags: needinfo?(vladan.bugzilla)
(In reply to Ritu Kothari (:ritu) from comment #63)
> Vladan: I am concerned about possible impact of this change (without
> sufficient stabilization on Nightly/Aurora) on Beta44 performance. 

This patch has been on Nightly since Dec 17. The changes unique to Beta are taking away functionality from the patch, they're not adding it.

> 1. How often would this data be sent to UT server? 

Firefox sends Telemetry data roughly once a day on a set schedule, sending is not related to how often we record Telemetry measurements in memory.

> I would expect Firefox end-users to use scrolling feature almost all the time. Could adding a
> telemetry probe to that oft-used action (and if probe is enabled by default)
> could possibly hurt the scrolling performance? 

Avi studied the overhead added by this probe and found it to be negligible.

> 2. I would like us to limit the # of telemetry probes we uplift in Beta
> cycle and the amount of code change here makes me uncomfortable.
> 
> Could we wait for this to ride the trains from Aurora45 -> Beta45?

We could wait, this uplift isn't critical, but we do need this data to make a recommendation on whether e10s is ready to ship to Release in Firefox 45. So in a way it's a tradeoff between Beta 44 stability vs Release 45 stability :)
Flags: needinfo?(vladan.bugzilla)
> This patch has been on Nightly since Dec 17. The changes unique to Beta are taking away
> functionality from the patch, they're not adding it.

Ignore this line, the patch nominated for uplift is actually the same for both Beta & Aurora.
I thought Avi had removed the UUID changes & tests from the beta version of the patch.
(In reply to Vladan Djeric (:vladan) -- please needinfo from comment #64)
> (In reply to Ritu Kothari (:ritu) from comment #63)
> > Vladan: I am concerned about possible impact of this change (without
> > sufficient stabilization on Nightly/Aurora) on Beta44 performance. 
> 
> This patch has been on Nightly since Dec 17. The changes unique to Beta are
> taking away functionality from the patch, they're not adding it.
> 
> > 1. How often would this data be sent to UT server? 
> 
> Firefox sends Telemetry data roughly once a day on a set schedule, sending
> is not related to how often we record Telemetry measurements in memory.
> 
> > I would expect Firefox end-users to use scrolling feature almost all the time. Could adding a
> > telemetry probe to that oft-used action (and if probe is enabled by default)
> > could possibly hurt the scrolling performance? 
> 
> Avi studied the overhead added by this probe and found it to be negligible.
> 
> > 2. I would like us to limit the # of telemetry probes we uplift in Beta
> > cycle and the amount of code change here makes me uncomfortable.
> > 
> > Could we wait for this to ride the trains from Aurora45 -> Beta45?
> 
> We could wait, this uplift isn't critical, but we do need this data to make
> a recommendation on whether e10s is ready to ship to Release in Firefox 45.
> So in a way it's a tradeoff between Beta 44 stability vs Release 45
> stability :)

Thanks Vladan! I absolutely believe these kinds of performance-measurement probes are useful. But I prefer uplifting it to Aurora45 for now and watch it for a few days. If all goes well and you are able to get valuable data (including good quality/useful data), we could re-think the need to land this in Beta. Hope that's a win-win for everyone.
Comment on attachment 8698993 [details] [diff] [review]
bug1228147.part2.v5.patch

Let's uplift to Aurora45 to stabilize.
Attachment #8698993 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

a year ago
status-firefox45: --- → affected
Comment on attachment 8699251 [details] [diff] [review]
bug1228147.part1.v6.patch

Aurora45+
Attachment #8699251 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 69

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/867787f8ae2a
https://hg.mozilla.org/releases/mozilla-aurora/rev/419239222c41
status-firefox45: affected → fixed

Updated

a year ago
status-firefox44: --- → affected
Comment on attachment 8699251 [details] [diff] [review]
bug1228147.part1.v6.patch

I thought about it some more and I really do not feel comfortable uplifting such a big patch to Beta at this stage. Please let this one ride the train from Aurora45 -> Beta45.
Attachment #8699251 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Comment on attachment 8701137 [details] [diff] [review]
bug1228147.part2-for-beta.patch

Please see my previous comment.
Attachment #8701137 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Updated

a year ago
status-firefox44: affected → wontfix
Hey Avi,

Can we pull results for this probe from our beta experiment?

https://wiki.mozilla.org/Electrolysis/Release_Criteria#Scrolling
Flags: needinfo?(avihpit)

Updated

a year ago
Blocks: 1255159
(Assignee)

Comment 73

a year ago
chutten took this. Thanks.
Flags: needinfo?(avihpit)
You need to log in before you can comment on or make changes to this bug.