Closed Bug 1182637 Opened 9 years ago Closed 8 years ago

[meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)

Categories

(Firefox :: General, defect, P1)

41 Branch
defect

Tracking

()

RESOLVED INVALID
Tracking Status
e10s + ---

People

(Reporter: vladan, Assigned: gkrizsanits)

References

(Blocks 1 open bug)

Details

(Whiteboard: [webvr])

According to Telemetry, the frequency of jank events lasting more than 100ms increased from ~19 events/min to ~21 events/minute with e10s. This was derived from the main-thread's event processing times and session uptime:
"gecko_hangs_per_minute" in http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb

^ Based on Telemetry collected from Nightly 41 on June 15th, from buildIDs in the range [20150601, 20150616]

Full e10s performance write-up: https://groups.google.com/d/msg/mozilla.dev.platform/elb0Au7qwG0/Yarsx1xrrnwJ
tracking-e10s: --- → ?
Vladan, one theory is that the cause is addons over-using CPOWs. Is there anyway you can either prove or disprove that theory with data?
Flags: needinfo?(vdjeric)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #1)
> Vladan, one theory is that the cause is addons over-using CPOWs. Is there
> anyway you can either prove or disprove that theory with data?

Yes, we could re-run the analysis on reports from profiles with no addons. Roberto, would you have time to do this sometime in the next couple of weeks?
Flags: needinfo?(vdjeric) → needinfo?(rvitillo)
we also have the MISBEHAVING_ADDONS_CPOW_TIME_MS, MISBEHAVING_ADDONS_JANK_LEVEL and ADDON_SHIM_USAGE histograms which we could use as controls
We don't have enough users not using e10s without add-ons to make any strong claims. Soon enough we are probably not going to have enough users to compare e10s and non e10s builds as well.

We should make sure that at least one quarter of Nightly users still runs non-e10s builds.
Flags: needinfo?(rvitillo)
Anthony: Can you re-run Roberto's gecko_hangs_per_minute comparison of e10s vs non-e10s responsiveness, on Aurora 41 using buildIDs from before July 29th, and profiles with no extensions except the default Firefox theme extension? 

Roberto's analysis on Nightly: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb
Flags: needinfo?(azhang)
Done: https://gist.github.com/Uberi/40458dbccd88388739f7

This one is based on rvitillo's no addons e10s analysis [1]. It uses data from Aurora 41 users with buildIDs from July 28th, 2015, who have no extensions installed (except the default theme extension).

[1]: https://gist.github.com/vitillo/8b6af93c6741e2278807
Flags: needinfo?(azhang)
Anthony, why is the "gecko_hangs_per_minute" comparison missing from your notebook?
Flags: needinfo?(azhang)
Missed that while checking over the notebook - the link above is updated now with all the info.
Flags: needinfo?(azhang)
Summary: Telemetry: e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s
This data comes from here - 
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/BackgroundHangMonitor.cpp

the current aurora experiment data is here - 
https://github.com/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb

search for 'gecko_hangs_per_minute'. This includes a random mixtures of users with addons installed. 

Results:

Median difference in hangs over 100ms per minute is 8.90, (31.18, 22.28).
The probability of this effect being purely by chance is 0.00.

This is pretty bad. Why the chrome thread would see this increase in jank is surprising considering how well we're doing on other performance stats. For starters we might want to look into the background thread measuring code to be sure it's working right with e10s.
Summary: [e10s][telemetry] e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons)
for clients without any extensions:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_noext_experiment.ipynb

for clients with at least one extension:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_onlyext_experiment.ipynb

for clients with just adblock plus installed:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_top_extension_experiment.ipynb
Summary: [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons) → [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment)
Blocks: 1224276
No longer blocks: e10s-perf
Depends on: 1224374
Assignee: nobody → gkrizsanits
I'm trying to grasp what gecko_hangs_per_minute really means for us. Given that the whole characteristic of these data changed with e10s I'm not fully convinced that this is a bad thing.

In the latest version this number went up from 20 to 26 (non-e10s vs e10s) per minute so I assume a 100ms jank is not a disaster. So instead of the counts of these janks their characteristic / distribution would be more interesting. If that's 20 200ms janks vs 26 102ms janks this can be even a win...

Vladan, could we plot these janks maybe? Or get their average in both cases at least? Without that I don't think this is a fair comparison.

Another thing I noticed that in MessagePumpDefault::Run we don't call notifyWait on the hangMonitor when we break. This should only matter when the main thread is over I think but still seem like some junk data... (http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_default.cc#21 ) Or is this intentional?
Flags: needinfo?(vladan.bugzilla)
poiru is preparing a plot of the activity histogram of the Gecko thread faceted by e10s status, which will answer the first part of Gabor's question.
Flags: needinfo?(birunthan)
Gabor: The BHR statistic refers to the median *rate* of jank events that last 100ms or *longer*. Poiru is plotting them. Note that we may actually be under-reporting the jank rate as a result of a client-side bug (bug 1228437) and the analyses need to be updated too
Depends on: 1228437
Flags: needinfo?(vladan.bugzilla)
ni'ing myself to look into BHR behaviour in message_pump_default.cc
Flags: needinfo?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13)
> Gabor: The BHR statistic refers to the median *rate* of jank events that
> last 100ms or *longer*. Poiru is plotting them. Note that we may actually be
> under-reporting the jank rate as a result of a client-side bug (bug 1228437)
> and the analyses need to be updated too

Thanks, I got that part, it's just that I found that info not sufficient to make a fair conclusion. That's why I asked for the plotting / distribution. About the second part, isn't that because we use MessagePumpForChildProcess in the child instead of MessagePumpDefault that does the BHR?
The thread stack activity histograms can be viewed here: http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_top_hang_stacks.ipynb

Note that the histograms are for the parent process only because of bug 1228437.
Flags: needinfo?(birunthan)
(In reply to Birunthan Mohanathas [:poiru] from comment #16)
> The thread stack activity histograms can be viewed here:
> http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/
> e10s_top_hang_stacks.ipynb

Million thanks, this is great stuff. So one thing that is new in e10s is 
- 7.3378% (0.0041%):
  IPDL::PCookieService::RecvGetCookieString

Blake, I think you've worked on this stuff, is there a chance that instead of pull scheme we use push scheme for cookie strings? So we would have the data in every process and when something change we broadcast it. It's probably not easy if possible at all to get it right... but would help this issue probably.

The other new big player is a bit more tricky:
- 33.5110% (18.4398%):
  Startup::XRE_Main

This pseudo stack is not very descriptive, it can be really anything, and very likely multiple things... I guess we need some new tags for e10s.
Flags: needinfo?(mrbkap)
Yeah, we probably can do that. I also have bug 1185734 where this was mentioned.
Flags: needinfo?(mrbkap)
Depends on: 1232638
Depends on: 1232641
No longer blocks: 1224276
Blocks: e10s-perf
Priority: -- → P1
Clearing needinfo, this is bug 1228437 now
Flags: needinfo?(vladan.bugzilla)
Summary: [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment) → [meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)
Depends on: 1235910
Based on Bill's measurements in Bug 1235910 Comment 3 this bug seems to be invalid. (after re-evaluating his statistics over the pings from the last few days I've got similar results).

Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we close this bug for now since with e10s we seem to have way less janks than without?
Flags: needinfo?(birunthan)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #20)

> Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we
> close this bug for now since with e10s we seem to have way less janks than
> without?

Let's wait for Bug 1228437 and re-evaluate the metric in a new experiment.
Flags: needinfo?(birunthan)
Looking at the overall distribution of hangs [1] over 100 ms reveals that the e10s population has more hangs than the non-e10s one, even without Bug 1228437.

The log-scaled distribution of the *number of hangs* per user [2] appears to have a bimodal shape for the e10s population, i.e. some users have less hangs per session than non-e10s users and some others have way more hangs than non-e10s ones.

[1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
[2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a
Thanks for the measurements!

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22)
> Looking at the overall distribution of hangs [1] over 100 ms reveals that
> the e10s population has more hangs than the non-e10s one, even without Bug
> 1228437.
> 

I'm sure you're doing the right thing just trying to understand this data.
Previously we've always normalized the hangs with the uptime:

uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1)
...
ping["gecko_hangs_per_minute"] = over100/uptime

to compare the hangs per minutes instead of the less interesting hangs per
sessions data (longer sessions -> more hangs I guess). Am I missing something
or you're not doing that any longer? If so can you explain me why?

> The log-scaled distribution of the *number of hangs* per user [2] appears to
> have a bimodal shape for the e10s population, i.e. some users have less
> hangs per session than non-e10s users and some others have way more hangs
> than non-e10s ones.
> 
> [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
> [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a

This is interesting... Is it possible to get some more data about the characteristics
of each group? Or investigating the environment of some users sampled randomly from each
group?
Flags: needinfo?(rvitillo)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #23)
> Thanks for the measurements!
> 
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22)
> > Looking at the overall distribution of hangs [1] over 100 ms reveals that
> > the e10s population has more hangs than the non-e10s one, even without Bug
> > 1228437.
> > 
> 
> I'm sure you're doing the right thing just trying to understand this data.
> Previously we've always normalized the hangs with the uptime:
> 
> uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1)
> ...
> ping["gecko_hangs_per_minute"] = over100/uptime
> 
> to compare the hangs per minutes instead of the less interesting hangs per
> sessions data (longer sessions -> more hangs I guess). Am I missing something
> or you're not doing that any longer? If so can you explain me why?

The notebook where the normalization is applied was written originally for an analysis on non experimental data, so we couldn't assume that the populatations were biased in some way wrt session duration.

The plots I have posted here are based on a A/B experiment, which means that there should be an equal amount of long sessions in both populations (unless e10s impacts the session duration which it doesn't) and as such it's reasonable to compare just the raw counts. Put in another way, if we were to run an A/A experiment on Beta we wouldn't expect those plots to differ significantly (given a big enough sample).

> > The log-scaled distribution of the *number of hangs* per user [2] appears to
> > have a bimodal shape for the e10s population, i.e. some users have less
> > hangs per session than non-e10s users and some others have way more hangs
> > than non-e10s ones.
> > 
> > [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
> > [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a
> 
> This is interesting... Is it possible to get some more data about the
> characteristics
> of each group? Or investigating the environment of some users sampled
> randomly from each
> group?

I am looking into it.
Flags: needinfo?(rvitillo)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #24)
> an analysis on non experimental data, so we couldn't assume that the
> populatations were biased in some way wrt session duration.

Oh, right! Thanks for explaining, it didn't occurred to me :)
distribution of hangs with users who do not have addons installed - 
https://gist.github.com/vitillo/9b3e30620000dc3779d3
Perhaps I'm misreading, but all these notebooks seem to look at builds starting with buildid 20151228134903. As explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1234618#c14, the fix for bug 1234618 didn't actually land on beta until 12/31. I don't know what the first buildid is that had the fix, but it's clearly after 12/31. So all this analysis invalid. The bimodal behavior probably reflects the fact that some builds you're looking at have the fix and others don't.
I thought the patch landed in b4 given https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually landed in b6 (build-id 20160104162232). Unfortunately there are not enough users left in the experiment that have a build-id >= 20160104162232 and we can't make any statistical significant claims.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> I thought the patch landed in b4 given
> https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually
> landed in b6 (build-id 20160104162232). Unfortunately there are not enough
> users left in the experiment that have a build-id >= 20160104162232 and we
> can't make any statistical significant claims.

https://gist.github.com/vitillo/9b3e30620000dc3779d3

This could be good news if the users who shifted to the left had the fix. I guess we'll find out in 45.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> can't make any statistical significant claims.

With the uptime normalization I would expect a lot faster convergence, have you tried that statistic too by any chance?
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #30)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> > can't make any statistical significant claims.
> 
> With the uptime normalization I would expect a lot faster convergence, have
> you tried that statistic too by any chance?

No, but there are only a few dozen users left so we are going to have to wait for the next experiment.
Whiteboard: [webvr]
Here's the updated analysis running against preliminary Beta 45 numbers: https://gist.github.com/chutten/aa269205a452be7fd89b

It's looking good, I think. Roberto, any comment?
Flags: needinfo?(rvitillo)
(In reply to Chris H-C :chutten from comment #32)
> Here's the updated analysis running against preliminary Beta 45 numbers:
> https://gist.github.com/chutten/aa269205a452be7fd89b
> 
> It's looking good, I think. Roberto, any comment?

Please submit a PR for e10s-analyses so we can review it there. The analysis is just comparing the parent activity so you are going to have to add the child activity as well for e10s since we have that data now.
Flags: needinfo?(rvitillo)
This bug is only about janks in the parent process, since that's what we consider as a possible blocker for e10s. For the child side janks we should file another bug. I'm closing this for now as we agreed on the last e10s meeting, if parent side janks turn out to be an issue again we can reopen it. Flagging as invalid since as it turned out the bug was in the measurements.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.