Add telemetry for measuring the average load caused by timeouts

RESOLVED FIXED in Firefox 54

Status

()

Core
DOM
P1
normal
RESOLVED FIXED
2 months ago
a month ago

People

(Reporter: farre, Assigned: farre)

Tracking

unspecified
mozilla55
Points:
---

Firefox Tracking Flags

(firefox54 fixed, firefox55 fixed)

Details

Attachments

(2 attachments, 6 obsolete attachments)

(Assignee)

Description

2 months ago
We need a way to measure how effective throttling of timeouts in the background is. The least intrusive way of doing this is measuring the average time between timeouts.
(Assignee)

Updated

2 months ago
Assignee: nobody → afarre
(Assignee)

Updated

2 months ago
Summary: Add telemetry for measuring the average time between timeouts → Add telemetry for measuring the average load caused by timeouts
(Assignee)

Comment 1

2 months ago
Created attachment 8857530 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-measuring-the-time-between.patch

Comment 2

2 months ago
As I said on IRC, we need to measure also foreground tracking timers.
(Assignee)

Comment 3

2 months ago
Yep, and I also realize at least one obvious error. We need to compute an average for non tracking execution and one for tracking execution separately, since we can't "clear" the current average when one or the other occurs as we do when switching from foreground to background.
(Assignee)

Comment 4

2 months ago
Created attachment 8857972 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-measuring-timeout-throttli.patch

Version 2. Moved computing utilization to TimeoutManager::EndRunningTimeout, because that made more sense. Also switched to resetting sample data to a call from nsGlobalWindow::SetIsBackground that can keep track if backgroundedness is being toggled.
Attachment #8857530 - Attachment is obsolete: true
Attachment #8857972 - Flags: feedback?(wmccloskey)
Attachment #8857972 - Flags: feedback?(ehsan)
Attachment #8857972 - Flags: feedback?(bugs)
Comment on attachment 8857972 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-measuring-timeout-throttli.patch

Review of attachment 8857972 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, looks good overall, but please see the comments below.  I'm not sure if the handling of nested timeouts is correct here...

::: dom/base/TimeoutManager.cpp
@@ +139,5 @@
>      mRunningTimeout(nullptr),
>      mIdleCallbackTimeoutCounter(1),
>      mBackPressureDelayMS(0),
> +    mThrottleTrackingTimeouts(gTrackingTimeoutThrottlingDelay <= 0),
> +    mIdleStart(TimeStamp::Now()),

mStart and mLastIdlePeriod are left uninitialized...

@@ +1099,5 @@
>  
> +void
> +TimeoutManager::ResetTelemetry()
> +{
> +  mAverageIdlePeriod = 0;

Should this be 5?

@@ +1102,5 @@
> +{
> +  mAverageIdlePeriod = 0;
> +  mAverageExecutionPeriod = 0;
> +  mAverageTrackingExecutionPeriod = 0;
> +  mIdleStart = TimeStamp::Now();

Perhaps we should reset them here too?

@@ +1168,5 @@
> +    ? (mRunningTimeout->mIsTracking ? Telemetry::TIMEOUT_INTERVAL_FG_TRACKING_MS
> +       : Telemetry::TIMEOUT_INTERVAL_FG_MS)
> +    : (mRunningTimeout->mIsTracking ? Telemetry::TIMEOUT_INTERVAL_BG_TRACKING_MS
> +       : Telemetry::TIMEOUT_INTERVAL_BG_MS);
> +  Telemetry::Accumulate(id, GetUtilization(mRunningTimeout, now));

Does this correctly deal with nested running of timeouts (as in, your main thread starts to run a timeout, the timeout handler calls alert(), which spins a nested event loop, which re-enters BeginRunningTimeout(), etc.)  It doesn't appear so...

::: dom/base/nsGlobalWindow.cpp
@@ +10527,5 @@
>        inner->mTimeoutManager->ResetTimersForThrottleReduction();
>      }
> +
> +    if (didToggle) {
> +      inner->mTimeoutManager->ResetTelemetry();

The "background-ness" of a window also changes when it starts/stop playing audio, should we also do this when that happens?
Attachment #8857972 - Flags: feedback?(ehsan) → feedback+

Comment 6

2 months ago
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #5)
> ::: dom/base/TimeoutManager.cpp
> @@ +139,5 @@
> >      mRunningTimeout(nullptr),
> >      mIdleCallbackTimeoutCounter(1),
> >      mBackPressureDelayMS(0),
> > +    mThrottleTrackingTimeouts(gTrackingTimeoutThrottlingDelay <= 0),
> > +    mIdleStart(TimeStamp::Now()),
> 
> mStart and mLastIdlePeriod are left uninitialized...
One doesn't need to initialize TimeStamps and looks like mStart and mLastIdlePeriod aren't every used
without initialization.

> 
> @@ +1099,5 @@
> >  
> > +void
> > +TimeoutManager::ResetTelemetry()
> > +{
> > +  mAverageIdlePeriod = 0;
> 
> Should this be 5?
yeah, it should be something. Where is the 5 coming from? please use some macro for it or pref.

Comment 7

2 months ago
Comment on attachment 8857972 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-measuring-timeout-throttli.patch

>+TimeoutManager::GetUtilization(Timeout* aTimeout, TimeStamp aNow)
>+{
>+  TimeDuration executionDuration = aNow - mStart;
>+  // We compute an exponential weighted moving average over the time
>+  // span of the most recent 5 seconds. The sample period is taken to
>+  // be from the end of the last timeout to the end of the current.
>+  double decay = std::exp(- (aNow - mIdleStart) / TimeDuration::FromSeconds(5));
>+
>+  mAverageIdlePeriod =
>+    (1 - decay) * mLastIdlePeriod.ToMilliseconds() + decay * mAverageIdlePeriod;
>+
>+  double averageExecutionPeriod;
>+  if (aTimeout->mIsTracking) {
>+    averageExecutionPeriod = mAverageTrackingExecutionPeriod =
>+      (1 - decay) * executionDuration.ToMilliseconds() +
>+      decay * mAverageTrackingExecutionPeriod;
>+  } else {
>+    averageExecutionPeriod = mAverageExecutionPeriod =
>+      (1 - decay) * executionDuration.ToMilliseconds() +
>+      decay * mAverageExecutionPeriod;
>+  }
>+
>+  // Using the average idle period and the average execution periods
>+  // of tracking and non tracking timeouts, we can compute the average
>+  // time spent executing either a tracking or non-tracking timeout.
>+  double average = averageExecutionPeriod /
>+    (mAverageExecutionPeriod + mAverageTrackingExecutionPeriod +
>+     mAverageIdlePeriod);
Could you rename the variable 'average' to hint what average we're calculating here.

>@@ -1104,7 +1161,18 @@ TimeoutManager::EndRunningTimeout(Timeout* aTimeout)
>   --mTimeoutFiringDepth;
>   --gRunningTimeoutDepth;
> 
>+  TimeStamp now = TimeStamp::Now();
>+
>+  Telemetry::HistogramID id =
>+    !IsBackground()
>+    ? (mRunningTimeout->mIsTracking ? Telemetry::TIMEOUT_INTERVAL_FG_TRACKING_MS
>+       : Telemetry::TIMEOUT_INTERVAL_FG_MS)
>+    : (mRunningTimeout->mIsTracking ? Telemetry::TIMEOUT_INTERVAL_BG_TRACKING_MS
>+       : Telemetry::TIMEOUT_INTERVAL_BG_MS);
>+  Telemetry::Accumulate(id, GetUtilization(mRunningTimeout, now));
>+
>   mRunningTimeout = aTimeout;
>+  mIdleStart = now;
> }
As ehsan said, we do need something to cover alert() (maybe also sync XHR, since even though it is sync, it does spin event loop).
Perhaps nsIDocument:: SetIsInSyncOperation could tell the relevant TimeoutManager how much time was used in a "sync" operation.
Attachment #8857972 - Flags: feedback?(bugs) → feedback+
Comment on attachment 8857972 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-measuring-timeout-throttli.patch

Review of attachment 8857972 [details] [diff] [review]:
-----------------------------------------------------------------

I think I understand what number is being computed here, but I'm not convinced that that number is what we want. It's possible that it is, but I would need a lot more explanation to understand why.

One concern I have is that we're only accumulating into the histogram when a timeout runs. If we have 20 seconds of idle time, that will look the same as 5 seconds of idle time. Maybe that's not a very likely scenario, but I'm concerned that even on smaller timescales it doesn't really make sense to do it this way. If timeouts run very frequently, we'll get a lot more samples. I think that will bias things.

I also don't understand why it's okay to divide two moving averages. What sort of number do you get as the result? This isn't a concrete objection; it's just a feeling. 

Maybe I don't have enough math to understand this stuff, but I'd prefer if we could do something simpler that more people could understand. I'd like to propose a simpler alternative. What if we kept counters of the total time spent in the following:
  idle (no timeouts running), BG tracking timeout, FG tracking timeout, BG non-tracking timeout, FG non-tracking timeout
So, five counters. We would create a timer that would fire every minute. When this timer fires, we would figure out what % of the minute was spent in each type of timeout. Then we would add each of those utilizations to a separate histogram.

If the throttling of tracking timeouts works, then ideally we would see average and median values of the distribution for tracking timeouts go down while for the non-tracking timeouts the average would stay the same or maybe go up.

The reason I like this number is because it's basically what we were computing in bug 1296486, so it seems like a pretty natural measurement to me.
Attachment #8857972 - Flags: feedback?(wmccloskey) → feedback-
I agree with Bill.  I have to admit that I didn't quite understand the math either, but I was shy because I don't have very strong statistics so I thought perhaps this is obvious to everyone.  Apparently not.  :D

I think Bill's proposal is really nice.  I used the histograms in bug 1296486 in the initial measurements that lead to us implementing the tracking timeout throttling feature, they work really well to make it obvious where the problem areas are, and I found it pretty easy to reason about them.
(Assignee)

Comment 10

2 months ago
The problem that I see with a timer based approach is what to do with the counters we've collected after we've recorded a sample. We can't keep gathering counters and compute a total utilization since the creation of the window, since then we can't see if our throttling works because a long idle period followed by a short burst of activity won't be noticed. We can't just clear the counters between sample periods either, because then we could get a timer right in the middle of a burst, which will then only be recorded as being half its size. And it is here that a weighted moving average comes into play.

By computing a decaying value based on the time since we captured our previous sample we can compute how much the current sample is worth with regard to _all_ the previous, where every previous sample's contribution is getting smaller as time passes.

Also, counting at every timeout isn't actually that strange. What we do is divide each sample to consist of one execution period one idle period (the one before the execution period). This makes sense, since it isn't until we get a timeout that we have any utilization to take into consideration.

Wikipedia has the equations here: https://en.wikipedia.org/wiki/Moving_average#Application_to_measuring_computer_performance

We have the counters for fg-tracking, bg-tracking, fg-non-tracking and bg-non-tracking and idle in this patch, but we keep them as a sum of all their values, where every older term in the sum keeps getting less and less important. And if we have smaller periods closer together we will count the current and _all_ the rest as being more equal.

I tried adding a printf to the current patch, printing the counters at every timeout for https://fiddle.jshell.net/vvL0e9x3/?utm_source=website&utm_medium=embed&utm_campaign=vvL0e9x3

which is the demo mentioned in https://www.chromestatus.com/feature/6172836527865856

and I got the values that I expected. When the window is in the foreground the moving average of idle time is ~450 ms and the foreground moving average of execution time is 50 ms. When the window is in the background I get 950 ms and 50 ms respectively (which is our current throttling strategy where timeouts run every second).

In the end I think that collecting counters for the total won't work, but clearing counters after timers firing and collecting data will, but will be slightly wrong. And it will need timers.
(Assignee)

Comment 11

2 months ago
(In reply to Andreas Farre [:farre] from comment #10)
> We have the counters for fg-tracking, bg-tracking, fg-non-tracking and
> bg-non-tracking and idle in this patch

Ahh, this isn't actually true. We conflate fg-tracking with bg-tracking and fg-non-tracking with bg-non-tracking respectively. We should have a distinct counter for each.

But I still think need to keep a moving average.
(Assignee)

Comment 12

2 months ago
(In reply to Andreas Farre [:farre] from comment #11)
> 
> But I still think need to keep a moving average.

And there might be another bug with how the moving average are computed when their value is greater than the sample period.
(Assignee)

Comment 13

2 months ago
(In reply to Andreas Farre [:farre] from comment #12)
> (In reply to Andreas Farre [:farre] from comment #11)
> > 
> > But I still think need to keep a moving average.
> 
> And there might be another bug with how the moving average are computed when
> their value is greater than the sample period.

So we could clamp idle time to 5 - execution time, that way idle time never dominates the sample period but we punish longer execution periods. 

And I should probably say that the 5 is a chrome thing. Their budget appears to be 50ms, and 50ms/0.01 = 5000ms, so to track 1% with 50ms budget an 1/100 second per second regeneration we need 5 seconds sample period.
(Assignee)

Comment 14

2 months ago
Also, using some equalities:

average = sum / nvalues

exec_average / (idle_average + exec_average) = (exec_sum / nvalues) / ((idle_sum / nvalues) + (exec_sum / nvalues)) = (exec_sum / nvalues) / ((idle_sum + exec_sum) / nvalues) = (exec_sum * nvalues) / ((idle_sum + exec_sum) * nvalues) = exec_sum / (idle_sum + exec_sum)

Which is what we want to compute, so dividing averaged should be ok.
(In reply to Andreas Farre [:farre] from comment #10)
> The problem that I see with a timer based approach is what to do with the
> counters we've collected after we've recorded a sample. We can't keep
> gathering counters and compute a total utilization since the creation of the
> window, since then we can't see if our throttling works because a long idle
> period followed by a short burst of activity won't be noticed. We can't just
> clear the counters between sample periods either, because then we could get
> a timer right in the middle of a burst, which will then only be recorded as
> being half its size. And it is here that a weighted moving average comes
> into play.

I think that clearing the counters when a sampling period ends is the right thing to do. I understand what you're saying about a period ending in the middle of a burst of setTimeouts. But I don't think it's worth worrying about.

> Also, counting at every timeout isn't actually that strange. What we do is
> divide each sample to consist of one execution period one idle period (the
> one before the execution period). This makes sense, since it isn't until we
> get a timeout that we have any utilization to take into consideration.
> 
> Wikipedia has the equations here:
> https://en.wikipedia.org/wiki/
> Moving_average#Application_to_measuring_computer_performance

But how can we interpret the data? Normally a moving average is used to display an instantaneous measurement--not to be accumulated in a big histogram. The problem with your approach is that you get more data points when there are more timeouts. Suppose we had the following situation:
1. For the first five seconds, a tracking timeout runs for all five seconds. This gives you one data point with 100% utilization.
2. For the next five seconds, we're idle. We add a 0% data point.
3. For the next five seconds, we run a timeout for 1 microsecond every millisecond. This adds 1000 data points where the utilization is ~0%.

Things look really awesome if you look at the histogram. There's one bad sample, but tons of of 0% samples. But in practice, 1/3 of the time things were horrible and the other 2/3 of the time things were good. Which isn't very good at all.

(In reply to Andreas Farre [:farre] from comment #14)
> Also, using some equalities:
> 
> average = sum / nvalues
> 
> exec_average / (idle_average + exec_average) = (exec_sum / nvalues) /
> ((idle_sum / nvalues) + (exec_sum / nvalues)) = (exec_sum / nvalues) /
> ((idle_sum + exec_sum) / nvalues) = (exec_sum * nvalues) / ((idle_sum +
> exec_sum) * nvalues) = exec_sum / (idle_sum + exec_sum)
> 
> Which is what we want to compute, so dividing averaged should be ok.

This is true for a normal average, but you're using an exponentially decaying moving average. I don't think the math works out in that case. If you have two series, a_i and b_i, and you're dividing the moving averages (with decay parameter alpha), you get:

[ Sum[i=1..n] alpha^i * a_i ] / [ Sum[i=1..n] alpha^i * b_i ]

I don't see any way to factor that into something reasonable. You probably want something more like this:

Sum[i=1..n] alpha^i * (a_i / b_i)

Which you could compute directly. But as I said above, I don't think this is the right approach anyway.
Priority: -- → P1
(Assignee)

Comment 16

2 months ago
Created attachment 8860359 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Right, totally different take on measuring timeout execution. As per Wednesdays meeting this adds probes for tracking, non-tracking foreground and background timeout execution time.

The sampling is driven from an interval timer, that only runs when there are timeouts. It currently is set to 1000ms.

Nested timeouts are handled by recording execution time in both BeginRunningTimeout and EndRunningTimeout, effectively "splitting" a timeout that spins an event loop into execution time before and after, but not inclusive.

The histograms are configured like this: https://telemetry.mozilla.org/histogram-simulator/#low=1&high=1000&n_buckets=20&kind=exponential&generate=log-normal
Attachment #8857972 - Attachment is obsolete: true
Attachment #8860359 - Flags: review?(wmccloskey)
Comment on attachment 8860359 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Review of attachment 8860359 [details] [diff] [review]:
-----------------------------------------------------------------

This looks really nice. I think you still need to deal with nested event loops (so that we don't count time waiting for the user to respond to an alert()). Olli mentioned a way to be notified when we enter/exit one, so we could use that. I'll needinfo him about it since I don't know how to do it.

::: dom/base/TimeoutManager.cpp
@@ +19,5 @@
>  
>  static LazyLogModule gLog("Timeout");
>  
> +// Time between sampling timeout execution time.
> +const uint32_t kTelemetryPeriod = 1000;

The variable name should include units.

@@ +74,5 @@
> +TimeoutManagers::Remove(TimeoutManager* aManager)
> +{
> +  aManager->removeFrom(mTimeoutManagers);
> +
> +  if (IsQuiescent() && mTelemetryTimer) {

I'm worried about the cost of calling IsQuiescent here. I think we're iterating over all frames in the browser every time a frame goes away.

Can we just remove this block? The timer will go off at most once more and then it should stop, so it doesn't seem like a big deal.

@@ +99,5 @@
> +                                        nsITimer::TYPE_REPEATING_SLACK);
> +}
> +
> +void
> +TimeoutManagers::StartRecording() {

Brace goes on its own line.

@@ +125,5 @@
> +  }
> +
> +  // Set the new start time to now. In case of nested timeouts we'll
> +  // not double-record execution time this way.
> +  mStart = now;

It seems a little odd to do this here. I would rather that callers had to call StartRecording afterward to get this behavior. It seems more explicit.

@@ +164,5 @@
> +  mTelemetryData = TelemetryData();
> +}
> +
> +bool
> +TimeoutManagers::IsQuiescent()

I'm worried that this basically has to iterate over all the frames in your browser. Could we instead keep a linked list only of managers that have been active recently? We would add a manager to the list when setTimeout is called. Then, inside the CollectTelemetry callback, we would remove entries from the list that don't have any timers. If the list ends up empty, then we could cancel the CollectTelemetry timer.

@@ +1282,4 @@
>    --mTimeoutFiringDepth;
>    --gRunningTimeoutDepth;
>  
> +  TimeoutManagers::Get().RecordExecution(mRunningTimeout, IsBackground());

Don't you potentially need to call StartRecording again if this was a nested timeout that ran? I guess we would do that if aTimeout is non-null.

::: dom/base/nsGlobalWindow.cpp
@@ +10521,4 @@
>      if (resetTimers) {
>        inner->mTimeoutManager->ResetTimersForThrottleReduction();
>      }
> +

Oops.

::: modules/libpref/init/all.js
@@ -1,1 @@
> -

Oops.
Attachment #8860359 - Flags: review?(wmccloskey) → feedback+
Olli, you mentioned a way to be notified when we enter/exit a nested event loop. How do you do that?
Flags: needinfo?(bugs)

Comment 19

2 months ago
The web phasing APIs which may trigger nested event loop handling should use nsAutoSyncOperation, things like sync XHR and alert().
Flags: needinfo?(bugs)
(Assignee)

Comment 20

2 months ago
Created attachment 8861303 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Made sure to account for sync operations using nsAutoSyncOperation. Did it similarly to what BeginRunningTimeout/EndRunningTimeout do, record the fragment up until that a sync starts, and continue after a sync ends. One difference is that the telemetry timer needs to restart after a sync.

Also, Bill, I think that I figured out how to make the telemetry timer run only when we've had activity without keeping a list of TimeoutManagers. Since we're on a timer that dispatches to the main thread as the timeouts do, then we can just check when the telemetry timer fires to see if we've recorded any counters, and if we haven't don't restart it. This requires that we count timeout managers instead to know if to destroy the timer if a TimeoutManager is destroyed.
Attachment #8860359 - Attachment is obsolete: true
Flags: needinfo?(wmccloskey)
Attachment #8861303 - Flags: review?(bugs)
(Assignee)

Comment 21

2 months ago
I'll add the telemetry review if we think that this is good enough.

Comment 22

2 months ago
I am very worried about adding yet another almost all the time running timer. That will drain battery by causing wakeups even more. We're already quite bad with wakeups.

Comment 23

2 months ago
Comment on attachment 8861303 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

I think we want this only for non-chrome.

or if we need telemetry for chrome, that should be separate probes.
Attachment #8861303 - Flags: review?(bugs)
(Assignee)

Comment 24

2 months ago
Created attachment 8861916 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Made telemetry collection timer-less again. Now collection happens when we start to run a new timeout and it was over 1000ms since we collected telemetry.

This might make us loose up to 1000ms of data if timeouts run for less than 1000ms and thereafter no timeouts at all. This is quite unlikely though, since data collection is per process and any TimeoutManager on that process helps drive the collection.
Attachment #8861303 - Attachment is obsolete: true
Flags: needinfo?(bugs)

Comment 25

2 months ago
Looks reasonable. And if I read right, BeginSyncOperation/EndSyncOperation handling should work also
if some other tabs runs timeouts. We just start/stop counting there.
Flags: needinfo?(bugs)
(Assignee)

Updated

2 months ago
Attachment #8861916 - Flags: feedback?(benjamin)

Updated

2 months ago
Attachment #8861916 - Flags: review+
(Assignee)

Comment 26

2 months ago
Created attachment 8861941 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Fixed nit discussed about on IRC.
Attachment #8861916 - Attachment is obsolete: true
Attachment #8861916 - Flags: feedback?(benjamin)
Attachment #8861941 - Flags: review+
Attachment #8861941 - Flags: feedback?(benjamin)
Yeah, this seems better I think.
Flags: needinfo?(wmccloskey)
Comment on attachment 8861941 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Two things to fix:
* please make these expire in 61 (six months/five releases from now). You can renew if the data is still useful/being used.
* Talked with afarre about this over IRC, and I suggesting changing the phrasing of "Time is measured over a sampling period of 1s." because this isn't actually sampling. Rather I suggest something like: "Multiple events are aggregated over a 1s interval."

data-r=me with those changes
Attachment #8861941 - Flags: feedback?(benjamin) → feedback+
(Assignee)

Comment 29

2 months ago
Created attachment 8863755 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-timeout-exec.patch

Carrying over r+ from :smaug, data-r+ from :bsmedberg and try looks good https://treeherder.mozilla.org/#/jobs?repo=try&revision=80840c059ff1c53f32b634b3e2e75a720514b48e
Attachment #8861941 - Attachment is obsolete: true
Attachment #8863755 - Flags: review+
(Assignee)

Comment 30

2 months ago
Setting checkin-needed, still having issues with git cinnabar and push. The subject line looks a bit wonky, might keep an eye when applying. Thanks.
Keywords: checkin-needed

Comment 31

2 months ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/69439c0cd3e6
Add telemetry for measuring timeout execution. r=smaug, data-r=bsmedberg
Keywords: checkin-needed
Backed out bug 1355480 and bug 833143 on suspicion of triggering assertion CompositorThreadHolder::IsInCompositorThread() on Windows 8 x64 M-e10s:

Bug 1355480
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e002af2f7019f8b45a05722efb882043962ae75

Bug 833143
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb15e1b1baee587c057ecc907759e223bf421315

Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5e01d4124eaefeb0702f31689a9223f14351368e&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=96034710&repo=mozilla-inbound

13:02:10     INFO - GECKO(1800) | Assertion failure: CompositorThreadHolder::IsInCompositorThread(), at c:/builds/moz2_slave/m-in-w64-d-0000000000000000000/build/src/gfx/layers/AnimationHelper.cpp:28
13:02:10     INFO - GECKO(1800) | #01: mozilla::layers::CompositorAnimationStorage::Release() [obj-firefox/dist/include/mozilla/layers/AnimationHelper.h:83]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #02: mozilla::layers::CompositorBridgeParent::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason) [gfx/layers/ipc/CompositorBridgeParent.cpp:644]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #03: mozilla::layers::PCompositorBridgeParent::OnChannelClose() [obj-firefox/ipc/ipdl/PCompositorBridgeParent.cpp:1808]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #04: mozilla::ipc::MessageChannel::NotifyChannelClosed() [ipc/glue/MessageChannel.cpp:2651]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #05: mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() [ipc/glue/MessageChannel.cpp:2512]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #06: mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel * const,void ( mozilla::ipc::MessageChannel::*)(void),0,1>::Run() [xpcom/threads/nsThreadUtils.h:911]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #07: MessageLoop::RunTask(already_AddRefed<mozilla::Runnable>) [ipc/chromium/src/base/message_loop.cc:362]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #08: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask &&) [ipc/chromium/src/base/message_loop.cc:372]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #09: MessageLoop::DoWork() [ipc/chromium/src/base/message_loop.cc:444]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #10: base::MessagePumpForUI::DoRunLoop() [ipc/chromium/src/base/message_pump_win.cc:213]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #11: base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate *,base::MessagePumpWin::Dispatcher *) [ipc/chromium/src/base/message_pump_win.cc:58]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #12: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:232]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #13: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:212]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #14: base::Thread::ThreadMain() [ipc/chromium/src/base/thread.cc:182]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #15: `anonymous namespace'::ThreadFunc [ipc/chromium/src/base/platform_thread_win.cc:29]
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #16: KERNEL32.DLL + 0x167e
13:02:10     INFO - 
13:02:10     INFO - GECKO(1800) | #17: ntdll.dll + 0x1c3f1
Flags: needinfo?(afarre)
pchang: smaug recommended you might have insight what could be the issue here. Any ideas? Thank you.
Flags: needinfo?(howareyou322)

Comment 34

2 months ago
It looks like we might get the nullptr about CompositorThread() in [1], instead of we are running on non-compositor thread.
Then we hit the assertion from comment 32.

Does this always happen because my patches were landed for a while? If it's intermittent failures, I can create a follow-up bug to fix it.

[1]http://searchfox.org/mozilla-central/source/gfx/layers/ipc/CompositorThread.cpp#145
Flags: needinfo?(howareyou322) → needinfo?(aryx.bugmail)
(Assignee)

Comment 35

2 months ago
I added the failing tests to the try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=80840c059ff1c53f32b634b3e2e75a720514b48e

and they don't seem to be a problem for this patch. Should we try checking in this patch again?
Flags: needinfo?(afarre)

Comment 36

2 months ago
After checking the push log, I found bug 833143 changed the shutdown sequence and might hit the problem I mentioned in comment 34. I just submitted a try to prove it.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5f96327d1c63f0f4d72542d2c1be24979f9aa46

I also agree that the patch in this bug shouldn't be related to this try failures.
(Assignee)

Updated

2 months ago
Keywords: checkin-needed

Comment 37

2 months ago
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b55f4a13a1aa
Add telemetry for measuring timeout execution. r=smaug, data-r=bsmedberg
Keywords: checkin-needed
The problem which required the backout is being tracked in bug 1361825 now.
Flags: needinfo?(aryx.bugmail)

Comment 39

2 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/b55f4a13a1aa
Status: NEW → RESOLVED
Last Resolved: 2 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
(Assignee)

Comment 40

2 months ago
Created attachment 8865792 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-beta-version.patch

Approval Request Comment
[Feature]:
Add telemetry for measuring the average load caused by timeouts.
[User impact if declined]:
We want to get a head start with collecting telemetry to better track the effect of Bug 1355311, Bug 1362322, and Bug 1357842. Without this patch the data on which we can decide the effectiveness of throttling will not be as decisive.
[Is this code covered by automated tests?]:
No, but this code doesn't actually implement a feature or fix a bug.
[Has the fix been verified in Nightly?]:
Yes.
[Needs manual test from QE? If yes, steps to reproduce]: 
No.
[List of other uplifts needed for the feature/fix]:
None.
[Is the change risky?]:
No.
[Why is the change risky/not risky?]:
This change only adds telemetry collection. The beta version of the patch is only for conflict resolution in Histogram.json
[String changes made/needed]:
None.
Attachment #8865792 - Flags: approval-mozilla-beta?

Updated

a month ago
status-firefox54: --- → affected
Hi :farre,
May I know if the telemetry will have any performance impact? It seems we add a lot of recording.
(Assignee)

Comment 42

a month ago
The way telemetry ended up being collected in this patch it is neither driven by its own timer nor necessarily being collected on every setTimeout/setInterval since we average it over 1s. That is, we call Telemetry::Accumulate at most once per second per content process, regardless of if we've had 50 setTimeout callbacks firing or 1, or 10 tabs on the same process or just 1. 

So the question is if at most once every second is alot, I guess. I feel I lack the experience to give a solid answer here so I'll bring in :ehsan on this.
Flags: needinfo?(ehsan)
Just to clarify, this does not run off a timer or something like that, but only when we run a timeout handler and reading the code I doubt the cost of running the additional code to collect the telemetry is that high, and I have never seen it show up in a profile.  I don't think this is something to be concerned about before we have any evidence to the contrary!
Flags: needinfo?(ehsan)
Comment on attachment 8865792 [details] [diff] [review]
0001-Bug-1355480-Add-telemetry-for-measuring-beta-version.patch

The patch help collect telemetry for measuring the average load caused by timeouts. Beta54+. Should be in 54 beta 8.
Attachment #8865792 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 45

a month ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/b1dd6f4104bc
status-firefox54: affected → fixed
You need to log in before you can comment on or make changes to this bug.