Closed
Bug 1182637
Opened 9 years ago
Closed 9 years ago
[meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)
Categories
(Firefox :: General, defect, P1)
Tracking
()
RESOLVED
INVALID
Tracking | Status | |
---|---|---|
e10s | + | --- |
People
(Reporter: vladan, Assigned: gkrizsanits)
References
(Blocks 1 open bug)
Details
(Whiteboard: [webvr])
According to Telemetry, the frequency of jank events lasting more than 100ms increased from ~19 events/min to ~21 events/minute with e10s. This was derived from the main-thread's event processing times and session uptime:
"gecko_hangs_per_minute" in http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb
^ Based on Telemetry collected from Nightly 41 on June 15th, from buildIDs in the range [20150601, 20150616]
Full e10s performance write-up: https://groups.google.com/d/msg/mozilla.dev.platform/elb0Au7qwG0/Yarsx1xrrnwJ
Updated•9 years ago
|
tracking-e10s:
--- → ?
Updated•9 years ago
|
Comment 1•9 years ago
|
||
Vladan, one theory is that the cause is addons over-using CPOWs. Is there anyway you can either prove or disprove that theory with data?
Flags: needinfo?(vdjeric)
Reporter | ||
Comment 2•9 years ago
|
||
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #1)
> Vladan, one theory is that the cause is addons over-using CPOWs. Is there
> anyway you can either prove or disprove that theory with data?
Yes, we could re-run the analysis on reports from profiles with no addons. Roberto, would you have time to do this sometime in the next couple of weeks?
Flags: needinfo?(vdjeric) → needinfo?(rvitillo)
Reporter | ||
Comment 3•9 years ago
|
||
we also have the MISBEHAVING_ADDONS_CPOW_TIME_MS, MISBEHAVING_ADDONS_JANK_LEVEL and ADDON_SHIM_USAGE histograms which we could use as controls
Comment 4•9 years ago
|
||
We don't have enough users not using e10s without add-ons to make any strong claims. Soon enough we are probably not going to have enough users to compare e10s and non e10s builds as well.
We should make sure that at least one quarter of Nightly users still runs non-e10s builds.
Flags: needinfo?(rvitillo)
Reporter | ||
Comment 5•9 years ago
|
||
Anthony: Can you re-run Roberto's gecko_hangs_per_minute comparison of e10s vs non-e10s responsiveness, on Aurora 41 using buildIDs from before July 29th, and profiles with no extensions except the default Firefox theme extension?
Roberto's analysis on Nightly: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/vitillo/cb6f1304316c1c1a2cbc/raw/e10s%20analysis.ipynb
Flags: needinfo?(azhang)
Done: https://gist.github.com/Uberi/40458dbccd88388739f7
This one is based on rvitillo's no addons e10s analysis [1]. It uses data from Aurora 41 users with buildIDs from July 28th, 2015, who have no extensions installed (except the default theme extension).
[1]: https://gist.github.com/vitillo/8b6af93c6741e2278807
Flags: needinfo?(azhang)
Comment 7•9 years ago
|
||
Anthony, why is the "gecko_hangs_per_minute" comparison missing from your notebook?
Flags: needinfo?(azhang)
Missed that while checking over the notebook - the link above is updated now with all the info.
Flags: needinfo?(azhang)
Reporter | ||
Updated•9 years ago
|
Blocks: e10s-measurement
Updated•9 years ago
|
Summary: Telemetry: e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s
Comment 9•9 years ago
|
||
This data comes from here -
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/BackgroundHangMonitor.cpp
the current aurora experiment data is here -
https://github.com/vitillo/e10s_analyses/blob/master/aurora/e10s_experiment.ipynb
search for 'gecko_hangs_per_minute'. This includes a random mixtures of users with addons installed.
Results:
Median difference in hangs over 100ms per minute is 8.90, (31.18, 22.28).
The probability of this effect being purely by chance is 0.00.
This is pretty bad. Why the chrome thread would see this increase in jank is surprising considering how well we're doing on other performance stats. For starters we might want to look into the background thread measuring code to be sure it's working right with e10s.
Updated•9 years ago
|
Summary: [e10s][telemetry] e10s is jankier than non-e10s → [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons)
Updated•9 years ago
|
Comment 10•9 years ago
|
||
for clients without any extensions:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_noext_experiment.ipynb
for clients with at least one extension:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_onlyext_experiment.ipynb
for clients with just adblock plus installed:
https://github.com/vitillo/e10s_analyses/blob/master/aurora/addons/e10s_top_extension_experiment.ipynb
Summary: [e10s][telemetry] e10s is jankier than non-e10s (aurora experiment w/addons) → [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment)
Updated•9 years ago
|
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → gkrizsanits
Assignee | ||
Comment 11•9 years ago
|
||
I'm trying to grasp what gecko_hangs_per_minute really means for us. Given that the whole characteristic of these data changed with e10s I'm not fully convinced that this is a bad thing.
In the latest version this number went up from 20 to 26 (non-e10s vs e10s) per minute so I assume a 100ms jank is not a disaster. So instead of the counts of these janks their characteristic / distribution would be more interesting. If that's 20 200ms janks vs 26 102ms janks this can be even a win...
Vladan, could we plot these janks maybe? Or get their average in both cases at least? Without that I don't think this is a fair comparison.
Another thing I noticed that in MessagePumpDefault::Run we don't call notifyWait on the hangMonitor when we break. This should only matter when the main thread is over I think but still seem like some junk data... (http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_default.cc#21 ) Or is this intentional?
Flags: needinfo?(vladan.bugzilla)
Comment 12•9 years ago
|
||
poiru is preparing a plot of the activity histogram of the Gecko thread faceted by e10s status, which will answer the first part of Gabor's question.
Flags: needinfo?(birunthan)
Reporter | ||
Comment 13•9 years ago
|
||
Gabor: The BHR statistic refers to the median *rate* of jank events that last 100ms or *longer*. Poiru is plotting them. Note that we may actually be under-reporting the jank rate as a result of a client-side bug (bug 1228437) and the analyses need to be updated too
Depends on: 1228437
Flags: needinfo?(vladan.bugzilla)
Reporter | ||
Comment 14•9 years ago
|
||
ni'ing myself to look into BHR behaviour in message_pump_default.cc
Flags: needinfo?(vladan.bugzilla)
Assignee | ||
Comment 15•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #13)
> Gabor: The BHR statistic refers to the median *rate* of jank events that
> last 100ms or *longer*. Poiru is plotting them. Note that we may actually be
> under-reporting the jank rate as a result of a client-side bug (bug 1228437)
> and the analyses need to be updated too
Thanks, I got that part, it's just that I found that info not sufficient to make a fair conclusion. That's why I asked for the plotting / distribution. About the second part, isn't that because we use MessagePumpForChildProcess in the child instead of MessagePumpDefault that does the BHR?
Comment 16•9 years ago
|
||
The thread stack activity histograms can be viewed here: http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_top_hang_stacks.ipynb
Note that the histograms are for the parent process only because of bug 1228437.
Flags: needinfo?(birunthan)
Assignee | ||
Comment 17•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #16)
> The thread stack activity histograms can be viewed here:
> http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/
> e10s_top_hang_stacks.ipynb
Million thanks, this is great stuff. So one thing that is new in e10s is
- 7.3378% (0.0041%):
IPDL::PCookieService::RecvGetCookieString
Blake, I think you've worked on this stuff, is there a chance that instead of pull scheme we use push scheme for cookie strings? So we would have the data in every process and when something change we broadcast it. It's probably not easy if possible at all to get it right... but would help this issue probably.
The other new big player is a bit more tricky:
- 33.5110% (18.4398%):
Startup::XRE_Main
This pseudo stack is not very descriptive, it can be really anything, and very likely multiple things... I guess we need some new tags for e10s.
Flags: needinfo?(mrbkap)
Comment 18•9 years ago
|
||
Yeah, we probably can do that. I also have bug 1185734 where this was mentioned.
Flags: needinfo?(mrbkap)
Reporter | ||
Comment 19•9 years ago
|
||
Clearing needinfo, this is bug 1228437 now
Flags: needinfo?(vladan.bugzilla)
Reporter | ||
Updated•9 years ago
|
Summary: [e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', aurora experiment) → [meta][e10s][telemetry] e10s is jankier than non-e10s ('gecko_hangs_per_minute', Telemetry experiments)
Assignee | ||
Comment 20•9 years ago
|
||
Based on Bill's measurements in Bug 1235910 Comment 3 this bug seems to be invalid. (after re-evaluating his statistics over the pings from the last few days I've got similar results).
Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we close this bug for now since with e10s we seem to have way less janks than without?
Flags: needinfo?(birunthan)
Comment 21•9 years ago
|
||
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #20)
> Should we wait for Bug 1228437 and re-evaluate the whole thing, or should we
> close this bug for now since with e10s we seem to have way less janks than
> without?
Let's wait for Bug 1228437 and re-evaluate the metric in a new experiment.
Flags: needinfo?(birunthan)
Comment 22•9 years ago
|
||
Looking at the overall distribution of hangs [1] over 100 ms reveals that the e10s population has more hangs than the non-e10s one, even without Bug 1228437.
The log-scaled distribution of the *number of hangs* per user [2] appears to have a bimodal shape for the e10s population, i.e. some users have less hangs per session than non-e10s users and some others have way more hangs than non-e10s ones.
[1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
[2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a
Assignee | ||
Comment 23•9 years ago
|
||
Thanks for the measurements!
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22)
> Looking at the overall distribution of hangs [1] over 100 ms reveals that
> the e10s population has more hangs than the non-e10s one, even without Bug
> 1228437.
>
I'm sure you're doing the right thing just trying to understand this data.
Previously we've always normalized the hangs with the uptime:
uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1)
...
ping["gecko_hangs_per_minute"] = over100/uptime
to compare the hangs per minutes instead of the less interesting hangs per
sessions data (longer sessions -> more hangs I guess). Am I missing something
or you're not doing that any longer? If so can you explain me why?
> The log-scaled distribution of the *number of hangs* per user [2] appears to
> have a bimodal shape for the e10s population, i.e. some users have less
> hangs per session than non-e10s users and some others have way more hangs
> than non-e10s ones.
>
> [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
> [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a
This is interesting... Is it possible to get some more data about the characteristics
of each group? Or investigating the environment of some users sampled randomly from each
group?
Flags: needinfo?(rvitillo)
Comment 24•9 years ago
|
||
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #23)
> Thanks for the measurements!
>
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #22)
> > Looking at the overall distribution of hangs [1] over 100 ms reveals that
> > the e10s population has more hangs than the non-e10s one, even without Bug
> > 1228437.
> >
>
> I'm sure you're doing the right thing just trying to understand this data.
> Previously we've always normalized the hangs with the uptime:
>
> uptime = ping["payload"]["simpleMeasurements"].get("uptime", -1)
> ...
> ping["gecko_hangs_per_minute"] = over100/uptime
>
> to compare the hangs per minutes instead of the less interesting hangs per
> sessions data (longer sessions -> more hangs I guess). Am I missing something
> or you're not doing that any longer? If so can you explain me why?
The notebook where the normalization is applied was written originally for an analysis on non experimental data, so we couldn't assume that the populatations were biased in some way wrt session duration.
The plots I have posted here are based on a A/B experiment, which means that there should be an equal amount of long sessions in both populations (unless e10s impacts the session duration which it doesn't) and as such it's reasonable to compare just the raw counts. Put in another way, if we were to run an A/A experiment on Beta we wouldn't expect those plots to differ significantly (given a big enough sample).
> > The log-scaled distribution of the *number of hangs* per user [2] appears to
> > have a bimodal shape for the e10s population, i.e. some users have less
> > hangs per session than non-e10s users and some others have way more hangs
> > than non-e10s ones.
> >
> > [1] https://gist.github.com/vitillo/8ac5eb3aec8f45464fc7
> > [2] https://gist.github.com/vitillo/718bf5f8d48d8a5d452a
>
> This is interesting... Is it possible to get some more data about the
> characteristics
> of each group? Or investigating the environment of some users sampled
> randomly from each
> group?
I am looking into it.
Flags: needinfo?(rvitillo)
Assignee | ||
Comment 25•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #24)
> an analysis on non experimental data, so we couldn't assume that the
> populatations were biased in some way wrt session duration.
Oh, right! Thanks for explaining, it didn't occurred to me :)
Comment 26•9 years ago
|
||
distribution of hangs with users who do not have addons installed -
https://gist.github.com/vitillo/9b3e30620000dc3779d3
Perhaps I'm misreading, but all these notebooks seem to look at builds starting with buildid 20151228134903. As explained in https://bugzilla.mozilla.org/show_bug.cgi?id=1234618#c14, the fix for bug 1234618 didn't actually land on beta until 12/31. I don't know what the first buildid is that had the fix, but it's clearly after 12/31. So all this analysis invalid. The bimodal behavior probably reflects the fact that some builds you're looking at have the fix and others don't.
Comment 28•9 years ago
|
||
I thought the patch landed in b4 given https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually landed in b6 (build-id 20160104162232). Unfortunately there are not enough users left in the experiment that have a build-id >= 20160104162232 and we can't make any statistical significant claims.
Comment 29•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> I thought the patch landed in b4 given
> https://bugzilla.mozilla.org/show_bug.cgi?id=1235910#c5 but it actually
> landed in b6 (build-id 20160104162232). Unfortunately there are not enough
> users left in the experiment that have a build-id >= 20160104162232 and we
> can't make any statistical significant claims.
https://gist.github.com/vitillo/9b3e30620000dc3779d3
This could be good news if the users who shifted to the left had the fix. I guess we'll find out in 45.
Assignee | ||
Comment 30•9 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> can't make any statistical significant claims.
With the uptime normalization I would expect a lot faster convergence, have you tried that statistic too by any chance?
Comment 31•9 years ago
|
||
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #30)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #28)
> > can't make any statistical significant claims.
>
> With the uptime normalization I would expect a lot faster convergence, have
> you tried that statistic too by any chance?
No, but there are only a few dozen users left so we are going to have to wait for the next experiment.
Updated•9 years ago
|
Blocks: e10s-responsiveness
Updated•9 years ago
|
Whiteboard: [webvr]
Comment 32•9 years ago
|
||
Here's the updated analysis running against preliminary Beta 45 numbers: https://gist.github.com/chutten/aa269205a452be7fd89b
It's looking good, I think. Roberto, any comment?
Flags: needinfo?(rvitillo)
Comment 33•9 years ago
|
||
(In reply to Chris H-C :chutten from comment #32)
> Here's the updated analysis running against preliminary Beta 45 numbers:
> https://gist.github.com/chutten/aa269205a452be7fd89b
>
> It's looking good, I think. Roberto, any comment?
Please submit a PR for e10s-analyses so we can review it there. The analysis is just comparing the parent activity so you are going to have to add the child activity as well for e10s since we have that data now.
Flags: needinfo?(rvitillo)
Assignee | ||
Comment 34•9 years ago
|
||
This bug is only about janks in the parent process, since that's what we consider as a possible blocker for e10s. For the child side janks we should file another bug. I'm closing this for now as we agreed on the last e10s meeting, if parent side janks turn out to be an issue again we can reopen it. Flagging as invalid since as it turned out the bug was in the measurements.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•